微软开源GraphRAG的使用教程-使用自定义数据测试GraphRAG_graphrag 安装

作者：小舞很执着 | 2024-07-19 01:48:18

踩

graphrag 安装

在这里插入图片描述

微软在今年4月份的时候提出了GraphRAG的概念，然后在上周开源了GraphRAG,Github链接见https://github.com/microsoft/graphrag,截止当前，已有6900+Star。

安装教程

官方推荐使用Python3.10-3.12版本，我使用Python3.10版本安装时，在初始化项目过程中会报错，切换到Python3.11版本后运行正常，推测是Python3.10与微软的一些最新的SDK不兼容。所以建议使用Python3.11的环境，安装GraphRAG比较简单，直接下面一行代码即可安装成功。

pip install graphrag
1

使用教程

在这个教程中，我们使用马伯庸的《太白金星有点烦》这个短篇小说为例，测试下使用微软开源的GraphRAG的处理效果。

注意，GraphRAG是使用LLM来提取文本片段中的实体关系，因此耗费Token数较多，如果是个人调研使用，不建议使用GPT4级别的模型（费用太高，不差钱的大佬请忽略此条建议）。综合成本和效果，我这里使用的是DeepSeek-Chat模型。

初始化项目

我这边先创建了一个临时测试目录myTest，然后按照官方教程，在myTest目录下创建了input目录，并把《太白金星有点烦》这本书的txt版本重命名为book.txt后放到input目录下。然后调用python -m graphrag.index --init 进行初始化工作，生成一些配置文件。

mkdir ./myTest/input
curl https://www.xxx.com/太白金星有点烦.txt > ./myTest/input/book.txt  // 这里是示例代码，大家在测试时根据实际情况放入自己要测试的txt文本即可。
cd ./myTest
python -m graphrag.index --init
1
2
3
4

执行完成后，会在当前目录（即MyTest）目录下生成几个新的文件夹：output-后续执行生成的中间结果会保存到这个目录中；prompts-处理过程中用到的一些Prompt内容；.env-大模型API配置文件，里面默认就一个GRAPHRAG_API_KEY 用于配置大模型的apiKey；settings.yaml-该文件是整体的配置信息，如果我们使用的非OPENAI的官方模型和官方API，我们需要修改此配置文件来让GraphRAG按照我们指定的配置文件执行。

配置相关文件

先在.env文件中配置大模型API的Key，这个配置是全局生效的。我们在.env文件中配置完成后，不需要在settings.yaml文件中重复配置。settings.yaml中使用的默认模型为gpt-4-turbo-preview ，如果不需要修改模型以及调用的API地址，那现在就已经配置完成了，后续的配置内容可以执行忽略并直接到执行阶段。

我这里使用的是agicto 提供的APIkey(主要是新用户注册可以免费获取到10块钱的调用额度，白嫖还是挺爽的)。我在这里主要就修改了API地址和调用模型的名称，修改完成后的settings文件完整内容如下：

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://api.agicto.cn/v1
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    api_base: https://api.agicto.cn/v1
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145

执行并构建图索引

此流程是GraphRAG的核心流程，即构建基于图的知识库用于后续的问答环节，通过以下代码即可触发执行。

python -m graphrag.index
1

基于微软在论文中提到的实现思路，执行过程GraphRAG主要实现了如下功能：

Source Documents → Text Chunks：将源文档分割成文本块。
Text Chunks → Element Instances：从每个文本块中提取图节点和边的实例。
Element Instances → Element Summaries：为每个图元素生成摘要。
Element Summaries → Graph Communities：使用社区检测算法将图划分为社区。
Graph Communities → Community Summaries：为每个社区生成摘要。
Community Summaries → Community Answers → Global Answer：使用社区摘要生成局部答案，然后汇总这些局部答案以生成全局答案。

整体执行耗时与具体的文本大小有关。我这个例子整体耗时大概20分钟，耗费人民币大约4块钱。执行过程中的输出如下：


声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小舞很执着/article/detail/848792

推荐阅读

相关标签