原来我们活在编码器里

前言

估计AI还有很长一段时间才能达到输出完全可信赖的结果。 在部分可信赖的前提下,有以下几个事实:

所以自动化、智能化的信息输入和输出,需要将GPT api应用到用于信息输入的浏览器、以及用于信息输出的编辑器中。 技术路线是一个典型的自动编码器结构如下:

浏览器 -> 抓取信息 -> AI编码(摘要)-> 人 -> 写出压缩信息(摘要)-> AI解码 -> 发布

正文

Emacs openai layer 中包括了openai api 包。 这样可以构建一个非常合理的架构: 在本地私有layer中,所有需要用GPT的地方(其实就只有信息发布和信息采集两个应用),加上openai依赖。 例如在org-roam的信息采集模板

  (setq org-roam-capture-ref-templates
        '(("p" "Protocol" plain
           "* [[${ref}][%(transform-square-brackets-to-round-ones \"${title}\")]]\n - Source: %u\n#+BEGIN_QUOTE\n%(format-i-string)\n#+END_QUOTE"
           :target (file+head "${slug}.org" "#+title: ${title}")
           :unnarrowed t)
          ("L" "Protocol Link" plain
           "* [[${ref}][%(transform-square-brackets-to-round-ones \"${title}\")]]\n - Source: %u\n%(get-text-from-url)"
           :target (file+head "${slug}.org" "#+title: ${title}")
           :unnarrowed t)))

中的采集文本函数直接使用openai的api

  (defun get-text-from-url ()
    (if (yes-or-no-p "Summarize the web page %s by GPT?")
        (let ((web-text
               (subset-string-to-max-length
                 (shell-command-to-string
                  (format
                   "/home/dustin/miniconda3/bin/python3 ~/github/data_extractor/html2text.py --url \"%s\""
                   (plist-get org-store-link-plist :link)))
                13000)))
          (progn
            (defvar my-callback nil)
            (setq my-callback nil)
            ;; (message web-text)
            (openai-chat
             `[(("role" . "system") ("content" . "I will provide you a web page inner text. You should ignore the noise text in it, and summarize in less than 10 bullets in org format in Chinese language."))
               (("role" . "user") ("content" . ,web-text))]
             (lambda (data)
               (let ((choices (let-alist data .choices)))
                 (setq my-callback (mapconcat
                                    (lambda (choice)
                                      (let-alist choice
                                        (let-alist .message
                                          (progn
                                          ;; (message (string-trim .content))
                                          (string-trim .content)
                                          )
                                          )
                                        )
                                      )
                                    choices "\n"))))
             :max-tokens 2000
             :temperature 1.0
             :model"gpt-3.5-turbo-16k-0613")
            (cl-loop until my-callback do (sit-for 0.3))
            my-callback
            ))
      ""))

其中html2text.py用来从html源码中提取正文信息

import argparse
from goose3 import Goose
from goose3.text import StopWordsChinese
import re
import requests

class extractor:
    def __init__(self):
        self.g_cn = Goose({'stopwords_class': StopWordsChinese})
        self.g_en = Goose()
        pass

    def __contains_chinese(self, text):
        pattern = re.compile(r'[\u4e00-\u9fff]')  # Range for Chinese characters
        return bool(pattern.search(text))

    def __get_html(self, url):
        try:
            response = requests.get(url)
            response.raise_for_status()  # Raise an exception for bad status codes
            return response.text  # Return the raw HTML content
        except requests.RequestException as e:
            print(f"Failed to retrieve HTML: {e}")
            return None

    def url(self, url):
        html_string = self.__get_html(url)
        return self.html(html_string)
        pass

    def html(self, html_string):
        html_string = html_string.strip(" \n")
        if html_string == "":
            return ""

        if self.__contains_chinese(html_string):
            g = self.g_cn
        else:
            g = self.g_en

        article = g.extract(raw_html = html_string)
        return article.cleaned_text.strip(" \n")

    def html_file(self, html_file):
        with open(html_file, 'r') as file:
            html_string = file.read()
            return self.html(html_string = html_string)

def main():
    parser = argparse.ArgumentParser(description = 'format string vector')
    parser.add_argument('--string', dest='inString', default = "", help='input string')
    parser.add_argument('--file', dest='inFile', default = "", help='input string File')
    parser.add_argument('--url', dest='inURL', default = "", help='input URL')
    args = parser.parse_args()

    inString = args.inString.strip(" \n")
    inFile = args.inFile.strip(" \n")
    inURL = args.inURL.strip(" \n")

    reString = ""
    e = extractor()
    if inString != "":
        reString = e.html(inString)
    elif inFile != "":
        reString = e.html_file(inFile)
    elif inURL != "":
        reString = e.url(inURL)

    print(reString)

if __name__ == '__main__':
    main()

这样直接将正文作为api的输入,可以节省token且能够直接传达正文信息,去除了噪声。

石见 石页 /
在共享协议(CC)下发布于
类别: technology 
标签: gpt  emacs  编码器  中