赞
踩
模型供应商可以选择ollama,Xinference等供应商,也可以用我们本地api封装的大模型
graphrag我们可快速开始,无需下载官方项目
pip install graphrag
直接运行即可
然后我们只需要在本地创建一个文件夹
- mkdir my_graphrag
- cd my_graphrag
- python -m graphrag.index --init --root ./ragtest
运行成功后目录如下:
接下来我们可以看到有一个yaml文件,打开他可以直接修改配置,注意我们用的是Xinference,所以切记首字母大写,很多博主都是小写,这或许是个坑,这里谨慎一点。
我们只修改这俩部分即可,其他的按个人需求定制即可
接下来我们将一篇txt文件放入input中即可:(我们不用官方的,直接创建文件夹即可)
mkdir input
接下来我们只需要索引即可:
python -m graphrag.index --root ./ragtest
拉取实体可能会慢一些,这个跟网络有关系,所以需要等一会直到看见以下照片,运行正常后:
如果出现报错 create_base_entity_graph 需要修改源码文件
①anaconda->envs->bzp_graphrag(你自己取的环境名字)
②/lib/python3.11/site-packages/graphrag/llm/openai/openai_chat_llm.py
我们寻找到class OpenAIChatLLM这个类下面的_invoke_json()函数,大概在第60行左右。
然后再定位到是这个出问题。这个代码表达的是如果参数kwargs.get("is_response_valid")可以找到到就采用它,找不到就默认用(lambda _x: True)。我们不使用openai就会导致,每次既能get到kwargs.get("is_response_valid"),又得到kwargs.get("is_response_valid")值是false的。
is_response_valid = kwargs.get("is_response_valid") or (lambda _x: True)
因此我们就索性,删掉kwargs['is_response_valid']这个选项。我们就在上面这个代码加上如下改动即可,这样每次就会返回lambad等于True
- if 'is_response_valid' in kwargs:
- del kwargs['is_response_valid']
- is_response_valid = kwargs.get("is_response_valid") or (lambda _x: True)
修改完后我们重新拉起索引即可,成功后我们就可以和模型对话了
执行global输入
- python -m graphrag.query \
- --root ./ragtest \
- --method global \
- "What are the top themes in this story?"
执行local输入
- python -m graphrag.query \
- --root ./ragtest \
- --method local \
- "Who is Scrooge, and what are his main relationships?"
这里一般都会正常运行对话就不做演示了。
我们准备一个脚本把output/xxxx/artifacts/中所有.parquet的文件转换为csv
- import os
- import pandas as pd
- import csv
-
- parquet_dir = 'graphrag_cs_glm/ragtest/output/20240729-141251/artifacts'
- csv_dir = 'neo4j-community-4.3.5/import'
-
- def clean_quotes(value):
- if isinstance(value,str):
- value = value.strip().replace('""','"').replace('"','')
- if ',' in value or '"' in value:
- value = f'"{value}"'
- return value
-
-
- for file_name in os.listdir(parquet_dir):
- if file_name.endswith('.parquet'):
- parquet_file = os.path.join(parquet_dir,file_name)
- csv_file = os.path.join(csv_dir,file_name.replace('.parquet','.csv'))
-
- df = pd.read_parquet(parquet_file)
-
- for column in df.select_dtypes(include=['object']).columns:
- df[column] = df[column].apply(clean_quotes)
-
- df.to_csv(csv_file,index=False,quoting=csv.QUOTE_NONNUMERIC)
- print(f'数据{parquet_file} to {csv_file} successfull')
- print('All parquet files have been converted to CSV.')
按照需求替换到自己neo4j文件下的路径即可
接下来就是neo4j的工作
我们需要把配置conf文件中的neo4j.conf的input注释去掉
dbms.directories.import=import
接下来我们只需要打开neo4j运行一下命令即可:
- // 1. Import Documents
- LOAD CSV WITH HEADERS FROM 'file:///create_final_documents.csv' AS row
- CREATE (d:Document {
- id: row.id,
- title: row.title,
- raw_content: row.raw_content,
- text_unit_ids: row.text_unit_ids
- });
-
- // 2. Import Text Units
- LOAD CSV WITH HEADERS FROM 'file:///create_final_text_units.csv' AS row
- CREATE (t:TextUnit {
- id: row.id,
- text: row.text,
- n_tokens: toFloat(row.n_tokens),
- document_ids: row.document_ids,
- entity_ids: row.entity_ids,
- relationship_ids: row.relationship_ids
- });
-
- // 3. Import Entities
- LOAD CSV WITH HEADERS FROM 'file:///create_final_entities.csv' AS row
- CREATE (e:Entity {
- id: row.id,
- name: row.name,
- type: row.type,
- description: row.description,
- human_readable_id: toInteger(row.human_readable_id),
- text_unit_ids: row.text_unit_ids
- });
-
- // 4. Import Relationships
- LOAD CSV WITH HEADERS FROM 'file:///create_final_relationships.csv' AS row
- CREATE (r:Relationship {
- source: row.source,
- target: row.target,
- weight: toFloat(row.weight),
- description: row.description,
- id: row.id,
- human_readable_id: row.human_readable_id,
- source_degree: toInteger(row.source_degree),
- target_degree: toInteger(row.target_degree),
- rank: toInteger(row.rank),
- text_unit_ids: row.text_unit_ids
- });
-
- // 5. Import Nodes
- LOAD CSV WITH HEADERS FROM 'file:///create_final_nodes.csv' AS row
- CREATE (n:Node {
- id: row.id,
- level: toInteger(row.level),
- title: row.title,
- type: row.type,
- description: row.description,
- source_id: row.source_id,
- community: row.community,
- degree: toInteger(row.degree),
- human_readable_id: toInteger(row.human_readable_id),
- size: toInteger(row.size),
- entity_type: row.entity_type,
- top_level_node_id: row.top_level_node_id,
- x: toInteger(row.x),
- y: toInteger(row.y)
- });
-
- // 6. Import Communities
- LOAD CSV WITH HEADERS FROM 'file:///create_final_communities.csv' AS row
- CREATE (c:Community {
- id: row.id,
- title: row.title,
- level: toInteger(row.level),
- raw_community: row.raw_community,
- relationship_ids: row.relationship_ids,
- text_unit_ids: row.text_unit_ids
- });
-
- // 7. Import Community Reports
- LOAD CSV WITH HEADERS FROM 'file:///create_final_community_reports.csv' AS row
- CREATE (cr:CommunityReport {
- id: row.id,
- community: row.community,
- full_content: row.full_content,
- level: toInteger(row.level),
- rank: toFloat(row.rank),
- title: row.title,
- rank_explanation: row.rank_explanation,
- summary: row.summary,
- findings: row.findings,
- full_content_json: row.full_content_json
- });
-
- // 8. Create indexes for better performance
- CREATE INDEX FOR (d:Document) ON (d.id);
- CREATE INDEX FOR (t:TextUnit) ON (t.id);
- CREATE INDEX FOR (e:Entity) ON (e.id);
- CREATE INDEX FOR (r:Relationship) ON (r.id);
- CREATE INDEX FOR (n:Node) ON (n.id);
- CREATE INDEX FOR (c:Community) ON (c.id);
- CREATE INDEX FOR (cr:CommunityReport) ON (cr.id);
-
- // 9. Create relationships after all nodes are imported
- MATCH (d:Document)
- UNWIND split(d.text_unit_ids, ',') AS textUnitId
- MATCH (t:TextUnit {id: trim(textUnitId)})
- CREATE (d)-[:HAS_TEXT_UNIT]->(t);
-
- MATCH (t:TextUnit)
- UNWIND split(t.entity_ids, ',') AS entityId
- MATCH (e:Entity {id: trim(entityId)})
- CREATE (t)-[:HAS_ENTITY]->(e);
-
- MATCH (t:TextUnit)
- UNWIND split(t.relationship_ids, ',') AS relId
- MATCH (r:Relationship {id: trim(relId)})
- CREATE (t)-[:HAS_RELATIONSHIP]->(r);
-
- MATCH (e:Entity)
- UNWIND split(e.text_unit_ids, ',') AS textUnitId
- MATCH (t:TextUnit {id: trim(textUnitId)})
- CREATE (e)-[:MENTIONED_IN]->(t);
-
- MATCH (r:Relationship)
- MATCH (source:Entity {name: r.source})
- MATCH (target:Entity {name: r.target})
- CREATE (source)-[:RELATES_TO]->(target);
-
- MATCH (r:Relationship)
- UNWIND split(r.text_unit_ids, ',') AS textUnitId
- MATCH (t:TextUnit {id: trim(textUnitId)})
- CREATE (r)-[:MENTIONED_IN]->(t);
-
- MATCH (c:Community)
- UNWIND split(c.relationship_ids, ',') AS relId
- MATCH (r:Relationship {id: trim(relId)})
- CREATE (c)-[:HAS_RELATIONSHIP]->(r);
-
- MATCH (c:Community)
- UNWIND split(c.text_unit_ids, ',') AS textUnitId
- MATCH (t:TextUnit {id: trim(textUnitId)})
- CREATE (c)-[:HAS_TEXT_UNIT]->(t);
-
- MATCH (cr:CommunityReport)
- MATCH (c:Community {id: cr.community})
- CREATE (cr)-[:REPORTS_ON]->(c);
也可以让大模型帮你生成一个,将这段代码放入neo4j的页面中运行即可:
这里我已经运行过了就不重复运行了,如果出现id错误,可以删除第八个字段重新运行即可
到这里就结束了,近期也在尝试修改prompt的调整,欢迎大家一起讨论。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。