赞
踩
目前微软已经开源了GraphRAG的完整项目代码。对于某一些LLM的下游任务则可以使用GraphRAG去增强自己业务的RAG的表现。项目给出了两种使用方式:
(1)创建虚拟环境(已安装好anaconda),此处建议使用python3.11:
conda create -n GraphRAG python=3.11
conda activate GraphRAG
git clone https://github.com/microsoft/graphrag.git
cd graphrag
(1)安装poetry资源包管理工具及相关依赖:
pip install poetry
poetry install
(2)初始化
poetry run poe index --init --root .
正确运行后,此处会在graphrag目录下生成output、prompts、.env、settings.yaml文件
mkdir ./input
curl https://www.xxx.com/xxx.txt > ./input/book.txt #示例,可以替换为任何的txt文件
(1)修改.env文件(默认是隐藏的)中的api_key
vi .env #进入.env文件,并修改为自己的api_key
修改后是全局配置,后续不需要再次修改了
(2)修改settings.yaml文件,修改其中的使用的llm模型和对应的api_base
提前说明,因为GraphRAG需要多次调用大模型和Embedding,默认使用的是openai的GPT-4,花费及其昂贵(
土豪当我没说,配置也不需要改),建议大家可以使用其他模型或国产大模型的api
我这里使用的是agicto提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度,白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称,修改完成后的settings文件完整内容如下:
(代码行后有标记的为需要修改的地方),如果用的是agicto则则不用修改settings.yaml
encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: deepseek-chat #修改 model_supports_json: false # recommended if this is available for your model. api_base: https://api.agicto.cn/v1 #修改 # max_tokens: 4000 # request_timeout: 180.0 # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made parallelization: stagger: 0.3 # num_threads: 50 # the number of threads to use for parallel processing async_mode: threaded # or asyncio embeddings: ## parallelization: override the global parallelization settings for embeddings async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: text-embedding-3-small #修改 api_base: https://api.agicto.cn/v1 #修改 # api_base: https://<instance>.openai.azure.com # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\\.txt$" cache: type: file # or blob base_dir: "cache" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> storage: type: file # or blob base_dir: "output/${timestamp}/artifacts" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports" # connection_string: <azure_blob_storage_connection_string> # container_name: <azure_blob_storage_container_name> entity_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0 summarize_descriptions: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/summarize_descriptions.txt" max_length: 500 claim_extraction: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task # enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0 community_report: ## llm: override the global llm settings for this task ## parallelization: override the global parallelization settings for this task ## async_mode: override the global async_mode settings for this task prompt: "prompts/community_report.txt" max_length: 2000 max_input_length: 8000 cluster_graph: max_cluster_size: 10 embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes # num_walks: 10 # walk_length: 40 # window_size: 2 # iterations: 3 # random_seed: 597832 umap: enabled: false # if true, will generate UMAP embeddings for nodes snapshots: graphml: false raw_entities: false top_level_nodes: false local_search: # text_unit_prop: 0.5 # community_prop: 0.1 # conversation_history_max_turns: 5 # top_k_mapped_entities: 10 # top_k_relationships: 10 # max_tokens: 12000 global_search: # max_tokens: 12000 # data_max_tokens: 12000 # map_max_tokens: 1000 # reduce_max_tokens: 2000 # concurrency: 32
poetry run poe index --root .
成功后如下:
⠋ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
├── create_final_entities
├── create_final_nodes
├── create_final_communities
├── join_text_units_to_entity_ids
├── create_final_relationships
├── join_text_units_to_relationship_ids
├── create_final_community_reports
├── create_final_text_units
├── create_base_documents
└── create_final_documents
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/892350
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。