前言
估计AI还有很长一段时间才能达到输出完全可信赖的结果。 在部分可信赖的前提下,有以下几个事实:
- 在工业化时代,阶级差的实质是信息差
- 工业化的发展让大众用上了相同的量产产品
- 工业化产品固化了信息差
- 根据能量守恒和系统熵增原理,在被AI升级的工业化时代,AI使用率越高,那么其绩效越好
- 对于个体,自动化、智能化的信息输入和输出对改变信息差至关重要
- AI部分可信赖要持续很长时间
- 人不得不对AI生成的结果进行二次验证
所以自动化、智能化的信息输入和输出,需要将GPT api应用到用于信息输入的浏览器、以及用于信息输出的编辑器中。 技术路线是一个典型的自动编码器结构如下:
浏览器 -> 抓取信息 -> AI编码(摘要)-> 人 -> 写出压缩信息(摘要)-> AI解码 -> 发布
正文
Emacs openai layer 中包括了openai api 包。 这样可以构建一个非常合理的架构: 在本地私有layer中,所有需要用GPT的地方(其实就只有信息发布和信息采集两个应用),加上openai依赖。 例如在org-roam的信息采集模板
(setq org-roam-capture-ref-templates
'(("p" "Protocol" plain
"* [[${ref}][%(transform-square-brackets-to-round-ones \"${title}\")]]\n - Source: %u\n#+BEGIN_QUOTE\n%(format-i-string)\n#+END_QUOTE"
:target (file+head "${slug}.org" "#+title: ${title}")
:unnarrowed t)
("L" "Protocol Link" plain
"* [[${ref}][%(transform-square-brackets-to-round-ones \"${title}\")]]\n - Source: %u\n%(get-text-from-url)"
:target (file+head "${slug}.org" "#+title: ${title}")
:unnarrowed t)))
中的采集文本函数直接使用openai的api
(defun get-text-from-url ()
(if (yes-or-no-p "Summarize the web page %s by GPT?")
(let ((web-text
(subset-string-to-max-length
(shell-command-to-string
(format
"/home/dustin/miniconda3/bin/python3 ~/github/data_extractor/html2text.py --url \"%s\""
(plist-get org-store-link-plist :link)))
13000)))
(progn
(defvar my-callback nil)
(setq my-callback nil)
;; (message web-text)
(openai-chat
`[(("role" . "system") ("content" . "I will provide you a web page inner text. You should ignore the noise text in it, and summarize in less than 10 bullets in org format in Chinese language."))
(("role" . "user") ("content" . ,web-text))]
(lambda (data)
(let ((choices (let-alist data .choices)))
(setq my-callback (mapconcat
(lambda (choice)
(let-alist choice
(let-alist .message
(progn
;; (message (string-trim .content))
(string-trim .content)
)
)
)
)
choices "\n"))))
:max-tokens 2000
:temperature 1.0
:model"gpt-3.5-turbo-16k-0613")
(cl-loop until my-callback do (sit-for 0.3))
my-callback
))
""))
其中html2text.py
用来从html源码中提取正文信息
import argparse
from goose3 import Goose
from goose3.text import StopWordsChinese
import re
import requests
class extractor:
def __init__(self):
self.g_cn = Goose({'stopwords_class': StopWordsChinese})
self.g_en = Goose()
pass
def __contains_chinese(self, text):
pattern = re.compile(r'[\u4e00-\u9fff]') # Range for Chinese characters
return bool(pattern.search(text))
def __get_html(self, url):
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
return response.text # Return the raw HTML content
except requests.RequestException as e:
print(f"Failed to retrieve HTML: {e}")
return None
def url(self, url):
html_string = self.__get_html(url)
return self.html(html_string)
pass
def html(self, html_string):
html_string = html_string.strip(" \n")
if html_string == "":
return ""
if self.__contains_chinese(html_string):
g = self.g_cn
else:
g = self.g_en
article = g.extract(raw_html = html_string)
return article.cleaned_text.strip(" \n")
def html_file(self, html_file):
with open(html_file, 'r') as file:
html_string = file.read()
return self.html(html_string = html_string)
def main():
parser = argparse.ArgumentParser(description = 'format string vector')
parser.add_argument('--string', dest='inString', default = "", help='input string')
parser.add_argument('--file', dest='inFile', default = "", help='input string File')
parser.add_argument('--url', dest='inURL', default = "", help='input URL')
args = parser.parse_args()
inString = args.inString.strip(" \n")
inFile = args.inFile.strip(" \n")
inURL = args.inURL.strip(" \n")
reString = ""
e = extractor()
if inString != "":
reString = e.html(inString)
elif inFile != "":
reString = e.html_file(inFile)
elif inURL != "":
reString = e.url(inURL)
print(reString)
if __name__ == '__main__':
main()
这样直接将正文作为api的输入,可以节省token且能够直接传达正文信息,去除了噪声。