赞
踩
我最近接触到一个非常有趣的挑战,涉及到人工智能数字化大量文件的能力,并使用户可以在这些文件上提出复杂的与数据相关的问题,比如:
这些不是你可以仅通过使用RAG来解决的典型问题。相反,我们将利用LangChain
的SQLAgent
从人类文本中生成复杂的数据库查询。
文档应包含具有大量规格说明的数据,以及更多流畅、自然语言描述等。
我们将执行以下步骤,最终能够提出关于大量文档的复杂问题:
PDF文档
。GPT
分析每个文档的内容,将其解析为JSON
对象。SQLite
获取其他数据库中,分布在多个表中。LangChain SQL
代理程序通过自动生成SQL
语句来提出问题。备注:本文涵盖了涉及人工智能和数据处理的概念。为了获得最大价值,您应具备对Python编程能力、GPT模型接入能力、嵌入式技术了解、向量搜索和SQL数据库的基础理解以及使用能力。
我们将使用Python
和LangChain
来读取和分析PDF
文档。我使用的 Python
为 Python 3.11
。
首先,我们安装环境所需要的依赖包:
%pip install pypdf
%pip install langchain
%pip install langchain_openai
%pip install sqlite3
# 导入 pdf 阅读器
from pypdf import PdfReader
# 导入langchain 的消息类型
from langchain_core.messages import HumanMessage, SystemMessage
# 导入 openAI
from langchain_openai import ChatOpenAI
现在,让我们来深入研究 PDF
解析。我们的目标是使用 visitor_text
提取有意义的内容,同时忽略不太有用的信息,例如空行、页眉和页脚。
document_content = None def visitor_body(text, cm, tm, fontDict, fontSize): y = tm[5] if text and 35 < y < 770: page_contents.append(text) with open(f'./documents/ZMP_55852_XBO_1000_W_HS_OFR.pdf', 'rb') as file: pdf_reader = PdfReader(file) page_contents = [] for page in pdf_reader.pages: # 提取PDF每页文本的内容 page.extract_text(visitor_text=visitor_body) document_content = "\n".join(page_contents) print(document_content)
让我们查看解析后的文档:
Product family benefits _ Short arc with very high luminance for brighter screen illumination _ Constant color temperature of 6,000 K throughout the entire lamp lifetime _ Easy to maintain _ High arc stability _ Instant light on screen thanks to hot restart function _ Wide dimming range Product family features _ Color temperature: approx. 6,000 K (Daylight) _ Wattage: 450…10,000 W _ Very good color rendering index: Ra > Product datasheet XBO 1000 W/HS OFR XBO for cinema projection | Xenon short-arc lamps 450…10,000 W [..] Packaging unit (Pieces/Unit) Dimensions (length x width x height) Volume Gross weight 4008321082114 XBO 1000 W/HS OFR Shipping carton box 1 410 mm x 184 mm x 180 mm 13.58 dm³ 819.00 g [..]
在解析的内容中,显而易见地发现它缺乏结构 — 表格不连贯,相关实体分散。
我们使用 GPT
重新帮我们整理文档的内容:
GPT
理解并整理文档。OpenAI Chat API
,我们将请求GPT
从一组新的解析产品数据中生成一个JSON
对象。让我们构建一条深思熟虑的系统消息来启动这个过程。我们将以清晰的指令为GPT
开头,接着呈现解析后的数据作为背景,并夹杂目标性提示来完善输出:
认真观察我们如何整合各种提示来塑造我们所需的精确JSON
输出。
你会分析产品描述,将其导出为 JSON 格式。我会向您展示一个产品数据表,并用 <<< 描述各个 JSON 对象和属性。然后您可以从另一个产品数据表中创建一个 JSON 对象。 >>> Example product: Product family benefits <<< benefits (string[]) _ Short arc with very high luminance for brighter screen illumination <<< benefits.[*] _ Constant color temperature of 6,000 K throughout the entire lamp lifetime <<< benefits.[*] [..] _ Wide dimming range <<< benefits.[*] Product family features <<< product_family (object) _ Color temperature: approx. 6,000 K (Daylight) <<< product_family.temperature = 6000 _ Wattage: 450…10,000 W <<< product_family.watts_min = 450, product_family.watts_max = 10000 _ Very good color rendering index: Ra > Product datasheet XBO 1000 W/HS OFR <<< name XBO for cinema projection | Xenon short-arc lamps 450…10,000 W <<< description [..] Technical data Electrical data <<< technical_data (object) Nominal current 50 A <<< technical_data.nominal_current = 50.00 Current control range 30…55 A <<< technical_data.control_range = 30, technical_data.control_range = 55 Nominal wattage 1000.00 W <<< technical_data.nominal_wattage = 1000.00 Nominal voltage 19.0 V <<< technical_data.nominal_voltage = 19.0 Dimensions & weight <<< dimensions (object) [..] Safe Use Instruction The identification of the Candidate List substance is <<< environmental_information.safe_use (beginning of string) sufficient to allow safe use of the article. <<< environmental_information.safe_use (end of string) Declaration No. in SCIP database 22b5c075-11fc-41b0-ad60-dec034d8f30c <<< environmental_information.scip_declaration_number (single string!) Country specific information [..] Shipping carton box 1 410 mm x 184 mm x <<< packaging_unity.length = 410, packaging_unit.width = 184 180 mm <<< packaging_unit.height = 180 [..] """
我的 prompt 是不同方法的集合:
在这里你可以完全发挥创意,尝试任何对你有意义的东西。而且需要多次调试 prompt 的内容以适应你的应用场景。
注意: 这里 prompt 最好还是英文的好, 最好不要中文和英文夹着来。
请将以下文本翻译成中文:
现在我们是时候测试一下 GPT
了,看看它是否能够完美地将我们混乱的 PDF
文本转换成一个整洁的 JSON
对象。
GPT-3.5-Turbo
的0125版本在以JSON
等请求格式响应时具有更高的准确性,这非常适合我们的情况! 我们已经准备好了system_message
,并将其与document_content
配对作为输入:
# 初始化 OpenAI Model chat = ChatOpenAI(model_name='gpt-3.5-turbo-0125', temperature=0) def convert_to_json(document_content): messages = [ # 这里是系统的角色设定 SystemMessage( content=system_message ), # 这里是我们的输入内容 HumanMessage( content=document_content ) ] # 构建 langchain 的链式 answer = chat.invoke(messages) return answer.content json = convert_to_json(document_content) # json 就是 OpenAI Model 返回的内容: print(json)
一切就绪,我们看一下我们得到的JSON
输出了:
{ "name": "XBO 1000 W/HS OFR", "description": "XBO for cinema projection | Xenon short-arc lamps 450…10,000 W", "applications": [ "Classic 35 mm film projection", "Digital film and video projection", "Architectural and effect light (“Light Finger”)", "Sunlight simulation" ], "technical_data": { "nominal_current": 50.00, "control_range_min": 30, "control_range_max": 55, "nominal_wattage": 1000.00, "nominal_voltage": 19.0 }, "dimensions": { "diameter": 40.0, "length": 235.0, "length_base": 205.00, "light_center_length": 95.0, "electrode_gap": 3.6, "weight": 255.00 }, "operating_conditions": { "max_temp": 230, "lifespan": 2000, "service_lifetime": 3000 }, "additional_data": { "base_anode": "SFa27-11", "base_cathode": "SFcX27-8", "product_remark": "OFR = Ozone-free version/H = Suitable for horizontal burning position/S = Short" }, "capabilities": { "cooling": "Forced", "burning_position": "s20/p20" }, "environmental_information": { "declaration_date": "10-03-2023", "primary_product_number": "4008321082114 | 4050300933566", "candidate_list_substance": "Lead", "cas_number": "7439-92-1", "safe_use": "The identification of the Candidate List substance is sufficient to allow safe use of the article.", "scip_declaration_number": "22b5c075-11fc-41b0-ad60-dec034d8f30c" }, "logistical_data": { "product_code": "4008321082114", "product_name": "XBO 1000 W/HS OFR", "packaging_unit": { "product_code": "4008321082114", "product_name": "XBO 1000 W/HS OFR", "length": 410, "width": 184, "height": 180, "volume": 13.58, "weight": 819.00 } } }
从结果看还是相当不错的。它找到的对象和属性是准确无误。
然而,有一个明显BUG:GPT忽略了一些关键元素,比如利益和产品系列
那么,我们的如何做呢?转向使用 GPT-4
看看效果,它提供增强功能但成本更高且响应时间较慢,还是调整策略以包括函数调用来优化资源同时保持效率?
在使用 GPT
时,function call
是我最喜欢的功能。它允许我们指定不仅 GPT
可执行的函数本身,还可以指定我们自己的函数所需的 JSON
参数。
下面是一个 function call
的示例:
"function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. beijing", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, }
最新的模型gpt-3.5-turbo-0125
和gpt-4-turbo-preview
经过训练,能够检测何时启动功能调用,并生成与指定函数签名相符的JSON
输出.
为了充分利用这一点,我们优化我们的提示,以包含我们期望返回的 JSON
模式。
You analyze product descriptions to export them into a JSON format. I will present you with a product data sheet and describe the individual JSON objects and properties with <<<. You then create a JSON object from another product data sheet. >>> Example product: Product family benefits <<< benefits (string[]) [..] ----- Provide your JSON in the following schema: { "type": "object", "properties": { "name": { "type": "string" }, "description": { "type": "string" }, "applications": { "type": "array", "items": { "type": "string" } }, "benefits": { "type": "array", "items": { "type": "string" } }, "product_family": { "type": "object", "properties": { "temperature": { "type": "number" }, "watts_min": { "type": "number" }, "watts_max": { "type": "number" } } }, "technical_data": { "type": "object", "properties": { "nominal_current": { "type": "number" }, "control_range_min": { "type": "number" }, "control_range_max": { "type": "number" }, "nominal_wattage": { "type": "number" }, "nominal_voltage": { "type": "number" } } }, "dimensions": { "type": "object", "properties": { "diameter": { "type": "number" }, "length": { "type": "number" }, "length_base": { "type": "number" }, "light_center_length": { "type": "number" }, "electrode_gap": { "type": "number" }, "weight": { "type": "number" } } }, "operating_conditions": { "type": "object", "properties": { "max_temp": { "type": "string" }, "lifespan": { "type": "number" }, "service_lifetime": { "type": "number" } } }, "logistical_data": { "type": "object", "properties": { "product_code": { "type": "string" }, "product_name": { "type": "string" }, "packaging_unit": { "type": "object", "properties": { "product_code": { "type": "string" }, "product_name": { "type": "string" }, "length": { "type": "number" }, "width": { "type": "number" }, "height": { "type": "number" }, "volume": { "type": "number" }, "weight": { "type": "number" } } } } } } }
调整我们的方法后,让我们来看一下新的输出:
{ "name": "XBO 1000 W/HS OFR", "description": "XBO for cinema projection | Xenon short-arc lamps 450…10,000 W", "applications": [ "Classic 35 mm film projection", "Digital film and video projection", "Architectural and effect light (“Light Finger”)", "Sunlight simulation" ], "benefits": [ "Short arc with very high luminance for brighter screen illumination", "Constant color temperature of 6,000 K throughout the entire lamp lifetime", "Easy to maintain", "High arc stability", "Instant light on screen thanks to hot restart function", "Wide dimming range" ], "product_family": { "temperature": 6000, "watts_min": 450, "watts_max": 10000 }, "technical_data": { "nominal_current": 50, "control_range_min": 30, "control_range_max": 55, "nominal_wattage": 1000.00, "nominal_voltage": 19.0 }, "dimensions": { "diameter": 40.0, "length": 235.0, "length_base": 205.00, "light_center_length": 95.0, "electrode_gap": 3.6, "weight": 255.00 }, "operating_conditions": { "max_temp": "230 °C", "lifespan": 2000, "service_lifetime": 3000 }, "logistical_data": { "product_code": "4008321082114", "product_name": "XBO 1000 W/HS OFR", "packaging_unit": { "product_code": "4008321082114", "product_name": "XBO 1000 W/HS OFR", "length": 410, "width": 184, "height": 180, "volume": 13.58, "weight": 819.00 } } }
这个结果很美丽哦
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。