赞
踩
chatting-with-the-knowledge-graph
Free Neo4j & LLMs Courses( 大模型知识图谱官方免费教程)
我们从原始数据提取了一些信息,用嵌入增强了它,然后将其扩展到图中:
接下来类似1.1中的过程,继续执行相同的模式(pattern)
还可以继续执行上面的过程,只要它跟你想要回答的问题种类来说是相关的:
公司和经理都有地址字符串。
拥有公司股票的经理提交了一些被分成块处理的表单。现在经理和公司都与地址相连接。
有了图中的地址,你可以问一些有意思的问题:
这个在这个笔记本(notebook)中将要处理的图的模式(schema),将进一步探索知识图谱。首先通过Cypher查询直接探索图,然后使用langchain创建一个问答对话。最后使用LLM将这两种技术结合起来。
from dotenv import load_dotenv import os import textwrap # Langchain from langchain_community.graphs import Neo4jGraph from langchain_community.vectorstores import Neo4jVector from langchain_openai import OpenAIEmbeddings from langchain.chains import RetrievalQAWithSourcesChain from langchain.prompts.prompt import PromptTemplate from langchain.chains import GraphCypherQAChain from langchain_openai import ChatOpenAI # Warning control import warnings warnings.filterwarnings("ignore")
定义一些查询需要的全局变量
# Load from environment load_dotenv('.env', override=True) NEO4J_URI = os.getenv('NEO4J_URI') NEO4J_USERNAME = os.getenv('NEO4J_USERNAME') NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD') NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j' OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') # Note the code below is unique to this course environment, and not a # standard part of Neo4j's integration with OpenAI. Remove if running # in your own environment. OPENAI_ENDPOINT = os.getenv('OPENAI_BASE_URL') + '/embeddings' # Global constants VECTOR_INDEX_NAME = 'form_10k_chunks' VECTOR_NODE_LABEL = 'Chunk' VECTOR_SOURCE_PROPERTY = 'text' VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'
创建一个Neo4j图实例
kg = Neo4jGraph(
url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)
你将使用一个更新的图形,该图形还包括视频中讨论的地址信息
kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))
#匹配一个位于某个地址的经理的模式
#然后返回经理和地址
kg.query("""
MATCH (mgr:Manager)-[:LOCATED_AT]->(addr:Address)
RETURN mgr, addr
LIMIT 1
""")
结果看到LAKEWOOD CAPITAL MANAGEMENT, LP 住在New York,结果里添加了一部分信息,不仅将城市和州添加到地址节点中,还添加了一个称为位置(location)的东西,我们在其中存储了一个称为点(POINT)的东西。里面存了数据值是经纬度
。有了经纬度就可以进行地理空间搜索
和附近搜索
等操作了。
#
kg.query("""
CALL db.index.fulltext.queryNodes(
"fullTextManagerNames",
"royal bank") YIELD node, score
RETURN node.managerName, score LIMIT 1
""")
- 所以 经理`royal bank` 的全名叫 `Royal Bank of Canada`
- score分数是进行全文搜索得到的分数,这与向量搜索得到的值范围不同,但是理念相同。得分越高,匹配度越好。
可以将之前的两个查询合并起来:
#寻找皇家银行:通过全文检索找到银行
#然后找到它们的位置:从经理那里找到他们的地址并返回这些值
kg.query("""
CALL db.index.fulltext.queryNodes(
"fullTextManagerNames",
"royal bank"
) YIELD node, score
WITH node as mgr LIMIT 1
MATCH (mgr:Manager)-[:LOCATED_AT]->(addr:Address)
RETURN mgr.managerName, addr
""")
#首先要对位于某个地址的经理进行模式匹配
#然后返回包括州名称的结果
#然后通过聚合进行计数
# 计算出该州出现的次数,并将那表示是经理的数量
# 然后降序排列,限制10个结果,最后就是排名最高的10个州和这个州所拥有的经理数
kg.query("""
MATCH p=(:Manager)-[:LOCATED_AT]->(address:Address)
RETURN address.state as state, count(address.state) as numManagers
ORDER BY numManagers DESC
LIMIT 10
""")
跟上面一样的套路:
kg.query("""
MATCH p=(:Company)-[:LOCATED_AT]->(address:Address)
RETURN address.state as state, count(address.state) as numCompanies
ORDER BY numCompanies DESC
""")
可以看出经理和公司都是加州最多,那就继续研究下加州。
#找到关键的模式,即位于某个地址的经理
# 然后同时将地址的州字段限定为加州
# 然后返回地址的城市(加州的城市)、count一下每个城市的个数
# 降序去前十名
kg.query("""
MATCH p=(:Manager)-[:LOCATED_AT]->(address:Address)
WHERE address.state = 'California'
RETURN address.city as city, count(address.city) as numManagers
ORDER BY numManagers DESC
LIMIT 10
""")
结果里看出有一些北加州和南加州的竞争。北部和南部比较密集,其他地方则各种混乱。
kg.query("""
MATCH p=(:Company)-[:LOCATED_AT]->(address:Address)
WHERE address.state = 'California'
RETURN address.city as city, count(address.city) as numCompanies
ORDER BY numCompanies DESC
""")
对比2.2.5和2.2.6的结果看出公司和经理在不同的城市。注意到旧金山有很多公司。
#对于位于地址的经理,模式跟之前一样
#城市限定为旧金山(San Francisco)
# 除了经理名(mgr.managerName),无论该公司投资了什么,都要将这些关系的所有价值属性之和称为总投资价值sum(owns.value)
# 然后返回10个按降序排序的结果
kg.query("""
MATCH p=(mgr:Manager)-[:LOCATED_AT]->(address:Address),
(mgr)-[owns:OWNS_STOCK_IN]->(:Company)
WHERE address.city = "San Francisco"
RETURN mgr.managerName, sum(owns.value) as totalInvestmentValue
ORDER BY totalInvestmentValue DESC
LIMIT 10
""")
可以看到这些结果里大多数都在 santa clara
# 将位于 Santa Clara 的地址城市匹配公司,
kg.query("""
MATCH (com:Company)-[:LOCATED_AT]->(address:Address)
WHERE address.city = "Santa Clara"
RETURN com.companyName
""")
上面都是使用显式关系
来探索图。还可以根据它们的位置坐标(经纬度)
来找些信息。因为我们添加了地理空间索引
。这很像在二维空间中进行向量搜索,但是使用的是笛卡尔坐标距离
,而不是余弦相似度。
#首先匹配一个地址,我们成为sc
#我们希望sc城市是Santa Clara (where 限制条件)
# 然后是模式:我们想要位于某个公司地址的公司
# 第四行where表示将两个不同的位置,及sc位置(Santa Clara 位置)和 公司地址位置(上面match到的公司)取point.distance,这是内置在cipher中的一个距离函数
# 附近的意思,所以希望两个距离小于10000米,单位是米
# 然后返回满足左右条件的公司的名称和公司地址(这些地址是在公司节点本身的完整文件中列出来的)
kg.query("""
MATCH (sc:Address)
WHERE sc.city = "Santa Clara"
MATCH (com:Company)-[:LOCATED_AT]->(comAddr:Address)
WHERE point.distance(sc.location, comAddr.location) < 10000
RETURN com.companyName, com.companyAddress
""")
kg.query("""
MATCH (address:Address)
WHERE address.city = "Santa Clara"
MATCH (mgr:Manager)-[:LOCATED_AT]->(managerAddress:Address)
WHERE point.distance(address.location,
managerAddress.location) < 10000
RETURN mgr.managerName, mgr.managerAddress
""")
如果将上面的距离改的远一点应该有更多公司出现,比如25KM
# Which investment firms are near Palo Aalto Networks? # 首先全文搜索Palo Aalto 公司 ,返回节点和分数 # 然后用这个节点作为公司节点,利用模式、 LOCATED_AT关系检索出公司和经理的地址 # 然后用where 和 point.distance 过滤掉两个地址不小于10KM的节点,然后返回满足条件的经理节点 # 和距离(距离做了1000取整,相当于米变km) # 然后距离降序取前10 kg.query(""" CALL db.index.fulltext.queryNodes( "fullTextCompanyNames", "Palo Aalto Networks" ) YIELD node, score WITH node as com MATCH (com)-[:LOCATED_AT]->(comAddress:Address), (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address) WHERE point.distance(comAddress.location, mgrAddress.location) < 10000 RETURN mgr, toInteger(point.distance(comAddress.location, mgrAddress.location) / 1000) as distanceKm ORDER BY distanceKm ASC LIMIT 10 """)
#chatgpt3.5 写Cypher # 在prompt中使用few-shot learning # 任务:生成Cypher语句来查询图数据库 # 以下是说明(Instructions): # 只使用模式中提供的关系类型和属性。不要使用未提供的任何其他关系类型或属性。 # 接下来,我们提供实际的模式(Schema)。 # Schema: # 然后是{},知识图谱的模式将被传递到LLM的提示中。 # 在创建时,作为标准的做法,请给LLM提供大量的指导: # 注意:不要在回复中包含任何解释或道歉。 # 不要回答任何可能会提出任何其他问题的问题,除了让你构建一个Cypher语句。 # 除了生成的Cypher语句外,不要包含任何文本。 # 最后,可以提供一些示例(Examples),这里我们只提供一个: # 以下是针对特定问题生成的Cypher语句的几个示例: # #(哈希符号)后面是自然语言中的问题本身:旧金山有哪些投资公司? # 最后给出相对于问题的Cypher语句 # 解释下这个语句:它是位于某个地址的经理,然后是一个带有where子句的字符串限制变量,最后返回经理的名称 # 最后的最后用问题本身结束提示 CYPHER_GENERATION_TEMPLATE = """ Task:Generate Cypher statement to query a graph database. Instructions: Use only the provided relationship types and properties in the schema. Do not use any other relationship types or properties that are not provided. Schema: {schema} Note: Do not include any explanations or apologies in your responses. Do not respond to any questions that might ask anything else than for you to construct a Cypher statement. Do not include any text except the generated Cypher statement. Examples: Here are a few examples of generated Cypher statements for particular questions: # What investment firms are in San Francisco? MATCH (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address) WHERE mgrAddress.city = 'San Francisco' RETURN mgr.managerName The question is: {question}"""
首先我们将采用Cypher生成模板,将其转换为一个Cypher生成提示,使用此提示模板类。
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question"],
template=CYPHER_GENERATION_TEMPLATE
)
接下来,我们将创建一种新类型的链(chain)
cypherChain = GraphCypherQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=kg,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)
LLM我们使用ChatOpenAI,图(Graph)将是我们直接查询Neo4j使用的知识图。verbose 参数为True将会详细说明正在进行的事情。最后使用的Cypher提示是我们上面创建的Cypher生成提示。
然后创建一个小的func封装一下(textwrap.fill() 只是使打印看起来更整洁,也可以使用 pprint() ):
def prettyCypherChain(question: str) -> str:
response = cypherChain.run(question)
print(textwrap.fill(response, 60))
接下来,我们应该通过询问LLM关于我们告诉它的事情来尝试一下了:
prettyCypherChain("What investment firms are in San Francisco?")
首先生成了Cypher语句;然后将语句直接发送给了Neo4j;最后查询到了我们想要的结果;
prettyCypherChain("What investment firms are in Menlo Park?")
尝试一些没有教他的东西,few shot里只教了投资公司(经理),没有涉及公司(company)
prettyCypherChain("What companies are in Santa Clara?")
最然few shot里面没有涉及公司(company),但是通过那个例子了解了图的模式,LLM能够生成一个Cypher: 从一个位于地址的公司中找到一个模式匹配,这个地址城市是 Santa Clara,这就像我们要求的那样。
我们之前做过的:不是找到位于特定城市的东西,而是要通过距离计算找出靠近一个城市的东西,LLM能够自己想出如何编写吗?
prettyCypherChain("What investment firms are near Santa Clara?")
哈哈,显然没有!!我们教他的只有where 子句子以及如何从中返回值。所以它需要再学习一点才能够回答这样的问题。
我们可以通过简单的更改提示并给它一些更多的示例来做到这一点。
# 在之前的提示模板上进行改进 # 为LLM添加新的示例 CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database. Instructions: Use only the provided relationship types and properties in the schema. Do not use any other relationship types or properties that are not provided. Schema: {schema} Note: Do not include any explanations or apologies in your responses. Do not respond to any questions that might ask anything else than for you to construct a Cypher statement. Do not include any text except the generated Cypher statement. Examples: Here are a few examples of generated Cypher statements for particular questions: # What investment firms are in San Francisco? MATCH (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address) WHERE mgrAddress.city = 'San Francisco' RETURN mgr.managerName # What investment firms are near Santa Clara? MATCH (address:Address) WHERE address.city = "Santa Clara" MATCH (mgr:Manager)-[:LOCATED_AT]->(managerAddress:Address) WHERE point.distance(address.location, managerAddress.location) < 10000 RETURN mgr.managerName, mgr.managerAddress The question is: {question}"""
一旦更新了Cypher生成模板,还必须更新所有从中构建的其他内容:使用新模板更新Cypher生成提示,并重新初始化Cypher链以使用新提示
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question"],
template=CYPHER_GENERATION_TEMPLATE
)
cypherChain = GraphCypherQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=kg,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)
更新完之后再试一下刚才他没回答上来的那个问题:
prettyCypherChain("What investment firms are near Santa Clara?")
NIce,看来是成功了。
我们教了它如何在where子句子中进行点到点距离的计算。我们想要Santa Clara 附近的投资公司,所以要匹配(match)地址与Santa Clara城市的地址,然后是模式匹配,以找到位于某个地址的经理,然后计算两者之间的距离,小于10千米就算是附近了。
我们拥有的第一批数据来自section1 ,section 1的业务实际上是干什么的呢?
# 在上面的提示模板继续添加内容 # 我们提供LLM的Cypher示例将使用全文搜索来找到公司的名称 # 示例中的拼写并不完全正确 # 然后我们将从那家公司进行匹配,我们已经将节点重命名为 com # 从提交了某些表格的公司,然后继续试 Form, # 然后是SECTION 关系到块(Chunk) # 然后用where限制这一部分为F10K项的第一部分,也就是‘item1’, 这会将我们带到属于第一项的一组快的第一块。最后返回这个块的文本(text),这将是我们提供给LLM实际回答问题的内容 # CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database. Instructions: Use only the provided relationship types and properties in the schema. Do not use any other relationship types or properties that are not provided. Schema: {schema} Note: Do not include any explanations or apologies in your responses. Do not respond to any questions that might ask anything else than for you to construct a Cypher statement. Do not include any text except the generated Cypher statement. Examples: Here are a few examples of generated Cypher statements for particular questions: # What investment firms are in San Francisco? MATCH (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address) WHERE mgrAddress.city = 'San Francisco' RETURN mgr.managerName # What investment firms are near Santa Clara? MATCH (address:Address) WHERE address.city = "Santa Clara" MATCH (mgr:Manager)-[:LOCATED_AT]->(managerAddress:Address) WHERE point.distance(address.location, managerAddress.location) < 10000 RETURN mgr.managerName, mgr.managerAddress # What does Palo Alto Networks do? CALL db.index.fulltext.queryNodes( "fullTextCompanyNames", "Palo Alto Networks" ) YIELD node, score WITH node as com MATCH (com)-[:FILED]->(f:Form), (f)-[s:SECTION]->(c:Chunk) WHERE s.f10kItem = "item1" RETURN c.text The question is: {question}"""
重构好了提示模板,然后重新创建链,然后问问题:
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question"],
template=CYPHER_GENERATION_TEMPLATE
)
cypherChain = GraphCypherQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=kg,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)
prettyCypherChain("What does Palo Alto Networks do?")
# Check the graph schema
kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database. Instructions: Use only the provided relationship types and properties in the schema. Do not use any other relationship types or properties that are not provided. Schema: {schema} Note: Do not include any explanations or apologies in your responses. Do not respond to any questions that might ask anything else than for you to construct a Cypher statement. Do not include any text except the generated Cypher statement. Examples: Here are a few examples of generated Cypher statements for particular questions: # What investment firms are in San Francisco? MATCH (mgr:Manager)-[:LOCATED_AT]->(mgrAddress:Address) WHERE mgrAddress.city = 'San Francisco' RETURN mgr.managerName # What investment firms are near Santa Clara? MATCH (address:Address) WHERE address.city = "Santa Clara" MATCH (mgr:Manager)-[:LOCATED_AT]->(managerAddress:Address) WHERE point.distance(address.location, managerAddress.location) < 10000 RETURN mgr.managerName, mgr.managerAddress # What does Palo Alto Networks do? CALL db.index.fulltext.queryNodes( "fullTextCompanyNames", "Palo Alto Networks" ) YIELD node, score WITH node as com MATCH (com)-[:FILED]->(f:Form), (f)-[s:SECTION]->(c:Chunk) WHERE s.f10kItem = "item1" RETURN c.text The question is: {question}"""
# Update the prompt and reset the QA chain
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question"],
template=CYPHER_GENERATION_TEMPLATE
)
cypherChain = GraphCypherQAChain.from_llm(
ChatOpenAI(temperature=0),
graph=kg,
verbose=True,
cypher_prompt=CYPHER_GENERATION_PROMPT,
)
prettyCypherChain("<<REPLACE WITH YOUR QUESTION>>")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。