赞
踩
微软在今年4月份的时候提出了GraphRAG的概念,然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前,已有6900+Star。
官方推荐使用Python3.10-3.12版本,我使用Python3.10版本安装时,在初始化项目过程中会报错,切换到Python3.11版本后运行正常,推测是Python3.10与微软的一些最新的SDK不兼容。所以建议使用Python3.11的环境,安装GraphRAG比较简单,直接下面一行代码即可安装成功。
pip install graphrag
在这个教程中,我们使用马伯庸的《太白金星有点烦》这个短篇小说为例,测试下使用微软开源的GraphRAG的处理效果。
注意,GraphRAG是使用LLM来提取文本片段中的实体关系,因此耗费Token数较多,如果是个人调研使用,不建议使用GPT4级别的模型(费用太高,不差钱的大佬请忽略此条建议)。综合成本和效果,我这里使用的是DeepSeek-Chat模型。
我这边先创建了一个临时测试目录myTest,然后按照官方教程,在myTest目录下创建了input目录,并把《太白金星有点烦》这本书的txt版本重命名为book.txt后放到input目录下。然后调用python -m graphrag.index --init
进行初始化工作,生成一些配置文件。
mkdir ./myTest/input
curl https://www.xxx.com/太白金星有点烦.txt > ./myTest/input/book.txt // 这里是示例代码,大家在测试时根据实际情况放入自己要测试的txt文本即可。
cd ./myTest
python -m graphrag.index --init
执行完成后,会在当前目录(即MyTest)目录下生成几个新的文件夹:output-后续执行生成的中间结果会保存到这个目录中;prompts-处理过程中用到的一些Prompt内容;.env-大模型API配置文件,里面默认就一个GRAPHRAG_API_KEY
用于配置大模型的apiKey;settings.yaml-该文件是整体的配置信息,如果我们使用的非OPENAI的官方模型和官方API,我们需要修改此配置文件来让GraphRAG按照我们指定的配置文件执行。
先在.env文件中配置大模型API的Key,这个配置是全局生效的。我们在.env文件中配置完成后,不需要在settings.yaml文件中重复配置。settings.yaml中使用的默认模型为gpt-4-turbo-preview
,如果不需要修改模型以及调用的API地址,那现在就已经配置完成了,后续的配置内容可以执行忽略并直接到执行阶段。
我这里使用的是agicto 提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度,白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称,修改完成后的settings文件完整内容如下:
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: deepseek-chat model_supports_json: false # recommended if this is available for your model. api_base: https://api.agicto.cn/v1 # max_tokens: 4000 # request_timeout: 180.0 # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made parallelization: stagger: 0.3 # num_threads: 50 # the number of threads to use for parallel processing async_mode: threaded # or asyncio embeddings: ## parallelization: override the global parallelization settings for embeddings async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: text-embedding-3-small api_base: https://api.agicto.cn/v1 # api_base: https://<instance>.openai.azure.com # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\\.txt$" cache: type: file # or blob base_dir: "cache" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> storage: type: file # or blob base_dir: "output/${timestamp}/artifacts" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> entity_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0 summarize_descriptions: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/summarize_descriptions.txt" max_length: 500 claim_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task # enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0 community_report: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000 cluster_graph: max_cluster_size: 10 embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes # num_walks: 10 # walk_length: 40 # window_size: 2 # iterations: 3 # random_seed: 597832 umap: enabled: false # if true, will generate UMAP embeddings for nodes snapshots: graphml: false raw_entities: false top_level_nodes: false local_search: # text_unit_prop: 0.5 # community_prop: 0.1 # conversation_history_max_turns: 5 # top_k_mapped_entities: 10 # top_k_relationships: 10 # max_tokens: 12000 global_search: # max_tokens: 12000 # data_max_tokens: 12000 # map_max_tokens: 1000 # reduce_max_tokens: 2000 # concurrency: 32
此流程是GraphRAG的核心流程,即构建基于图的知识库用于后续的问答环节,通过以下代码即可触发执行。
python -m graphrag.index
基于微软在论文中提到的实现思路,执行过程GraphRAG主要实现了如下功能:
整体执行耗时与具体的文本大小有关。我这个例子整体耗时大概20分钟,耗费人民币大约4块钱。执行过程中的输出如下:
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/正经夜光杯/article/detail/818481
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。