赞
踩
本文详细介绍了如何安装和配置 graphrag
包括通过pip和源码安装的方法。在源码安装部分,具体演示了克隆源码、配置python环境以及使用poetry安装依赖。此外,详细介绍了如何使用豆包大模型,创建索引、执行局部搜索和全局搜索,以及通过Docker安装Neo4j以进行知识图谱的可视化展示。最后,介绍了如何根据特定领域自定义prompt,以便更好地适应不同的应用场景。
pip install graphrag
因为pip的直接使用包的方式不方便修改源码,故采用源码安装的方式,同时下文皆基于源码安装的方式执行。
克隆源码:
git clone https://github.com/microsoft/graphrag.git
python环境安装(如有可跳过,使用pyenv管理环境):
# macos中pyenv安装命令,其余系统自行查找
brew install pyenv
# 配置环境变量
export PATH="$HOME/.pyenv/bin:$PATH"
eval "$(pyenv init --path)"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
# 试了3.12,但是poetry环境构建失败,建议3.10-3.11
pyenv install 3.11.9
pyenv global 3.11.9
安装poetry
# macos中poetry安装命令,其余系统自行查找
brew install poetry
# 如果使用的pyenv可以指定poetry使用当前pyenv的py版本
poetry env use $(pyenv which python)
# 进入文件夹
cd graphrag
# 安装依赖
poetry install
poetry shell
创建文件夹
mkdir Q
初始化文件夹Q
poetry run poe index --init --root Q
# 非源码安装命令 python -m graphrag.index --init --root Q
初始化后的Q
Q
├── .env
├── output
│ └── 20240726-142724
│ └── reports
│ └── indexing-engine.log
├── prompts
│ ├── claim_extraction.txt
│ ├── community_report.txt
│ ├── entity_extraction.txt
│ └── summarize_descriptions.txt
└── settings.yaml
再创建 cache 以及 input文件夹,并放入需要创建索引的文件,放入input的文件必须是txt格式
Q
├── .env
├── cache
├── input
│ └── 阿Q正传.txt
├── output
│ └── 20240726-142724
│ └── reports
│ └── indexing-engine.log
├── prompts
│ ├── claim_extraction.txt
│ ├── community_report.txt
│ ├── entity_extraction.txt
│ └── summarize_descriptions.txt
└── settings.yaml
接下来需要修改 .env 中的 key 以及配置文件settings.yaml
我使用的是火山中的豆包系列,其中的大模型以及向量化模型均有类openai的接口
settings.yaml需要修改 llm 和 embedding 中的 apibase 以及 model即可,tpm以及rpm按照实际情况修改即可。
settings.yaml参考
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: 豆包模型id model_supports_json: false # recommended if this is available for your model. max_tokens: 4000 request_timeout: 180.0 api_base: https://ark.cn-beijing.volces.com/api/v3/ # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> tokens_per_minute: 800_000 # set a leaky bucket throttle requests_per_minute: 10_000 # set a leaky bucket throttle max_retries: 10 max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # temperature: 0 # temperature for sampling # top_p: 1 # top-p sampling # n: 1 # Number of completions to generate parallelization: stagger: 0.3 # num_threads: 50 # the number of threads to use for parallel processing async_mode: threaded # or asyncio embeddings: ## parallelization: override the global parallelization settings for embeddings async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: 豆包模型id api_base: https://ark.cn-beijing.volces.com/api/v3/ encoding_format: float # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 1 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional chunks: size: 1200 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\\.txt$" cache: type: file # or blob base_dir: "cache" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> storage: type: file # or blob base_dir: "output/${timestamp}/artifacts" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> entity_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 1 summarize_descriptions: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/summarize_descriptions.txt" max_length: 500 claim_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task # enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1 community_reports: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000 cluster_graph: max_cluster_size: 10 embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes # num_walks: 10 # walk_length: 40 # window_size: 2 # iterations: 3 # random_seed: 597832 umap: enabled: false # if true, will generate UMAP embeddings for nodes snapshots: graphml: false raw_entities: false top_level_nodes: false local_search: # text_unit_prop: 0.5 # community_prop: 0.1 # conversation_history_max_turns: 5 # top_k_mapped_entities: 10 # top_k_relationships: 10 # llm_temperature: 0 # temperature for sampling # llm_top_p: 1 # top-p sampling # llm_n: 1 # Number of completions to generate # max_tokens: 12000 global_search: # llm_temperature: 0 # temperature for sampling # llm_top_p: 1 # top-p sampling # llm_n: 1 # Number of completions to generate # max_tokens: 12000 # data_max_tokens: 12000 # map_max_tokens: 1000 # reduce_max_tokens: 2000 # concurrency: 32
修改代码
graphrag/llm/openai/openai_embeddings_llm.py
在_execute_llm中args字典中添加 “encoding_format”: “float”。豆包无此参数会报错
async def _execute_llm(
self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
) -> EmbeddingOutput | None:
args = {
"model": self.configuration.model,
"encoding_format": "float",
**(kwargs.get("model_parameters") or {}),
}
embedding = await self.client.embeddings.create(
input=input,
**args,
)
return [d.embedding for d in embedding.data]
建立索引
poetry run poe index --root Q
# 非源码安装命令 python -m graphrag.index --root Q
过程可能比较慢,若文档较大,token的消耗量也很大,请注意自己的额度
等待成功即可出现 "All workflows completed successfully."即安装成功
因为使用豆包的embedding故局部搜索需要修改 graphrag/query/llm/oai/embedding.py 中的_embed_with_retry;大致在121行
def _embed_with_retry( self, text: str | tuple, **kwargs: Any ) -> tuple[list[float], int]: try: retryer = Retrying( stop=stop_after_attempt(self.max_retries), wait=wait_exponential_jitter(max=10), reraise=True, retry=retry_if_exception_type(self.retry_error_types), ) for attempt in retryer: if isinstance(text, tuple): text = [str(i) for i in text] with attempt: embedding = ( self.sync_client.embeddings.create( # type: ignore input=text, model=self.model, encoding_format="float", **kwargs, # type: ignore ) .data[0] .embedding or [] ) return (embedding, len(text)) except RetryError as e: self._reporter.error( message="Error at embed_with_retry()", details={self.__class__.__name__: str(e)}, ) return ([], 0) else: # TODO: why not just throw in this case? return ([], 0)
poetry run poe query --root Q --method local "阿Q的主要经历有哪些"
# 非源码安装命令 python -m graphrag.query --root Q --method local '阿Q的主要经历有哪些'
SUCCESS: Local Search Response: **一、与赵太爷的纠葛** 阿 Q 自称与赵太爷是本家,却遭到赵太爷的责骂和殴打。赵太爷对阿 Q 的态度变化反映了其地位和权威。[Data: Relationships (0)] **二、与假洋鬼子的矛盾** 假洋鬼子限制阿 Q 的行动并打了他,阿 Q 对假洋鬼子心怀愤恨。[Data: Relationships (3)] **三、向吴妈求爱引发风波** 阿 Q 突然向吴妈求困觉,导致吴妈反应强烈,这一事件在未庄引起了不小的轰动。[Data: Relationships (2)] **四、在赵家的经历** 阿 Q 在赵家舂米,目睹了赵家遭抢,其对赵家的看法也有所变化。[Data: Relationships (8)] **五、对革命的想法和遭遇** 阿 Q 对革命有想法并声称要参与,其遭遇也受到相关人物决策的影响。但他的革命愿望并未得到实现。[Data: Relationships (22)]
poetry run poe query --root Q --method global "这篇文章主要揭示了什么"
# 非源码安装命令 python -m graphrag.query --root Q --method global '这篇文章主要揭示了什么'
SUCCESS: Global Search Response: **一、人物关系**
文章主要揭示了未庄人物之间复杂的关系,例如阿 Q 与赵太爷、假洋鬼子、吴妈等人的互动[Data: (5, 6, 1, +more)]。
**二、地点影响**
展示了未庄作为故事核心地点,其环境和氛围对人物命运和行为有着重要作用[Data: (6, +more)]。
**三、革命参与**
提到了一些人物在革命中的角色和参与情况,像举人老爷、秀才等[Data: (1, +more)]。
使用docker安装neo4j
docker run \
-p 7474:7474 -p 7687:7687 \
--name neo4j-apoc \
-e NEO4J_apoc_export_file_enabled=true \
-e NEO4J_apoc_import_file_enabled=true \
-e NEO4J_apoc_import_file_use__neo4j__config=true \
-e NEO4J_PLUGINS=\[\"apoc\"\] \
neo4j:latest
访问http://localhost:7474/browser/并修改密码
初始用户名:neo4j
初始密码:neo4j
修改数据库参数以及output地址并执行以下代码
import pandas as pd from neo4j import GraphDatabase import time NEO4J_URI = "neo4j://localhost" # or neo4j+s://xxxx.databases.neo4j.io NEO4J_USERNAME = "neo4j" NEO4J_PASSWORD = "12345678" NEO4J_DATABASE = "neo4j" GRAPHRAG_FOLDER = "./output/20240724-151213/artifacts" # Create a Neo4j driver driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD)) statements = """ create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique; create constraint document_id if not exists for (d:__Document__) require d.id is unique; create constraint entity_id if not exists for (c:__Community__) require c.community is unique; create constraint entity_id if not exists for (e:__Entity__) require e.id is unique; create constraint entity_title if not exists for (e:__Entity__) require e.name is unique; create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique; create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique; """.split(";") for statement in statements: if len((statement or "").strip()) > 0: print(statement) driver.execute_query(statement) def batched_import(statement, df, batch_size=1000): """ Import a dataframe into Neo4j using a batched approach. Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch. """ total = len(df) start_s = time.time() for start in range(0, total, batch_size): batch = df.iloc[start: min(start + batch_size, total)] result = driver.execute_query("UNWIND $rows AS value " + statement, rows=batch.to_dict('records'), database_=NEO4J_DATABASE) print(result.summary.counters) print(f'{total} rows in {time.time() - start_s} s.') return total doc_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_documents.parquet', columns=["id", "title"]) doc_df.head(2) # import documents statement = """ MERGE (d:__Document__ {id:value.id}) SET d += value {.title} """ batched_import(statement, doc_df) text_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_text_units.parquet', columns=["id", "text", "n_tokens", "document_ids"]) text_df.head(2) statement = """ MERGE (c:__Chunk__ {id:value.id}) SET c += value {.text, .n_tokens} WITH c, value UNWIND value.document_ids AS document MATCH (d:__Document__ {id:document}) MERGE (c)-[:PART_OF]->(d) """ batched_import(statement, text_df) entity_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_entities.parquet', columns=["name", "type", "description", "human_readable_id", "id", "description_embedding", "text_unit_ids"]) entity_df.head(2) entity_statement = """ MERGE (e:__Entity__ {id:value.id}) SET e += value {.human_readable_id, .description, name:replace(value.name,'"','')} WITH e, value CALL db.create.setNodeVectorProperty(e, "description_embedding", value.description_embedding) CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node UNWIND value.text_unit_ids AS text_unit MATCH (c:__Chunk__ {id:text_unit}) MERGE (c)-[:HAS_ENTITY]->(e) """ batched_import(entity_statement, entity_df) rel_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_relationships.parquet', columns=["source", "target", "id", "rank", "weight", "human_readable_id", "description", "text_unit_ids"]) rel_df.head(2) rel_statement = """ MATCH (source:__Entity__ {name:replace(value.source,'"','')}) MATCH (target:__Entity__ {name:replace(value.target,'"','')}) // not necessary to merge on id as there is only one relationship per pair MERGE (source)-[rel:RELATED {id: value.id}]->(target) SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids} RETURN count(*) as createdRels """ batched_import(rel_statement, rel_df) community_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_communities.parquet', columns=["id", "level", "title", "text_unit_ids", "relationship_ids"]) community_df.head(2) statement = """ MERGE (c:__Community__ {community:value.id}) SET c += value {.level, .title} /* UNWIND value.text_unit_ids as text_unit_id MATCH (t:__Chunk__ {id:text_unit_id}) MERGE (c)-[:HAS_CHUNK]->(t) WITH distinct c, value */ WITH * UNWIND value.relationship_ids as rel_id MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__) MERGE (start)-[:IN_COMMUNITY]->(c) MERGE (end)-[:IN_COMMUNITY]->(c) RETURn count(distinct c) as createdCommunities """ batched_import(statement, community_df) community_report_df = pd.read_parquet(f'{GRAPHRAG_FOLDER}/create_final_community_reports.parquet', columns=["id", "community", "level", "title", "summary", "findings", "rank", "rank_explanation", "full_content"]) community_report_df.head(2) # import communities community_statement = """MATCH (c:__Community__ {community: value.community}) SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary} WITH c, value UNWIND range(0, size(value.findings)-1) AS finding_idx WITH c, value, finding_idx, value.findings[finding_idx] as finding MERGE (c)-[:HAS_FINDING]->(f:Finding {id: finding_idx}) SET f += finding""" batched_import(community_statement, community_report_df)
成功后回到http://localhost:7474/browser/即可查看图谱
可以使用自带的方法根据想要的主题生成prompt
poetry run poe prompt_tune --root aQ --domain "a software engineering code" --method random --limit 2 --chunk-size 500 --output prompt-project
# 非源码安装命令 python -m graphrag.prompt_tune --root aQ --domain 'a software engineering code' --method random --limit 2 --chunk-size 500 --output prompt-project
root - 指定配置yaml位置和输入文件位置
domain - 指定适配领域
method - 指定如何选取文档作为适配参考,可选all, random和top
limit - 在指定method为random或者top时,设置加载文件数量
max-tokens - 设置生成prompt的最大tokens数量
chunk-size - 设置chunk大小
language - 设置适配的语言
no-entity-type - 使用未分类实体提取
output - 设置生成的prompt位置,不然会直接覆盖默认的prompts
主要会生成3个prompt文件
community_report.txt
entity_extraction.txt
summarize_descriptions.txt
修改settings文件,指定自动生成的prompts,并修改需要提取的实体:
可能是模型的缘故生成的prompt不是很符合我想要的结果,于是对默认的prompt进行手调prompt。
主要修改entity_extraction.txt,修改 -Goal- 为自己设定的领域,再修改 -Steps- 中的 entity_type。将 -Steps- 中的prompt以及默认文件中的参考例子,利用 GPT 生成一些符合自己想要的领域的例子加进去。
其余的两个prompt都可以用同样的方式用GPT洗一次即可。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。