Langchain学习笔记（含知识库简单构建与灵活的文件对话）_如何使用langchain构建知识库

作者：木道寻08 | 2024-08-08 09:01:52

踩

如何使用langchain构建知识库

提示：学习使用Langchain的记录

文章目录

前言
一、准备可以用于Langchain的模型
二、PromptTemplate和Parser
二、记忆
三、链（Chains）
四、构建知识库进行问答
五、Tools、Agent（暂无）

前言

提示：主要是根据视频做的一个总结性的学习笔记
原始视频地址：有代码，有视频，讲的也不错
B站视频链接：清晰度比较低，但是有中文
Langchain中文网：视频中的部分内容（Agent之类）可能比较旧了，需要看文档比较好
可能需要的环境

pip install langchain
pip install langchain_community
pip install sentence-transformer
pip install sentence_transformers
pip install chromadb

一、准备可以用于Langchain的模型

这里需要我们继承一下

from langchain.llms.base import LLM

然后重写下这里的_call方法，一般来说取决你用的模型，推理方法可能不同，但是只要返回结果就可以。

from typing import Any, List, Optional
from langchain.llms.base import LLM
from langchain_core.callbacks import CallbackManagerForLLMRun
from pydantic import BaseModel, Field
from modelscope import AutoModelForCausalLM, AutoTokenizer
class QwenLLM(LLM):
	##这部分是必须的
    model: Any = Field(description="Qwen2-Model")
    tokenizer: Any = Field(description="Qwen2-Tokenizer")
    #generation_config: Any = Field(description="generation-config")

    def __init__(self):
    # 加载原模型
        super().__init__()
        model_name = "qwen/Qwen2-7B-Instruct"
        #model_dir = snapshot_download('TongyiFinance/Tongyi-Finance-14B-Chat', cache_dir = model_path)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir = 'models')
        self.model = AutoModelForCausalLM.from_pretrained(model_name,cache_dir = 'models',torch_dtype="auto",device_map="cuda")
        

        
        
    @property
    def _llm_type(self) -> str:
        return "QwenLLM"
 
    def _call(self, prompt: str, stop: Optional[List[str]] = None, run_manager: Optional[CallbackManagerForLLMRun] = None, **kwargs: Any) -> str:
        model_inputs = self.tokenizer([prompt], return_tensors="pt").to("cuda")
        generated_ids = self.model.generate(model_inputs.input_ids,max_new_tokens=512,temperature=0.01)
        generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        return response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

实例化成一个llm，后面我们将用这个llm来实例化Langchain的方法

llm = Qwen()

二、PromptTemplate和Parser

想要模型输出的结果符合你的要求你就需要给自己的需求加上很多的Pormpt，在Langchain中，提供了许多关于Prompt的模板，这里我简单介绍以下最基本的PromptTemplate

##PromptTemplate 创建
from langchain import PromptTemplate
template = "请翻译下面这段话：{english}。"
prompt_template =  PromptTemplate(input_variables=["english"], template=template)
word = "hello world!"
prompt = prompt_template.format(english = word)
#打印你创建的模板
print(prompt)

##PromptTemplate 使用
prompt = prompt_template.format(english = word)
##使用上面实例化的模型进行推理
response = llm(prompt)
print(response)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

对于模型返回的结果有些时候我们希望返回的内容是被整理好的，这就是Parser的功能，能够通过适当的Prompt让我们的模型返回一个期望的内容，并且能够通过解析，变为python中的数据类型方便处理。

下面简单介绍如何使用StructuredOutputParser, ResponseSchema返回并解析一个Json

#parser解释器使用
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
#定义需要的格式内容
response_schemas = [
    ResponseSchema(name="age", description="The age of the writer"),
    ResponseSchema(name="sex", description="Whether the writer is male or female"),
    ResponseSchema(name="hair", description="The color of the writer's hair")
]

# 使用定义的响应结构创建StructuredOutputParser
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
# 格式化指令，告诉模型如何返回信息
format_instructions = output_parser.get_format_instructions()
passage = "I am a young lady, I am twenty years old, I have long red hair"
##定义需求模板
passage_template = '''For the following text, extract the following information：
age: How old is the writter?
sex: Is he a man or women?
hair: How about his hair color?

text:{text}
{format_instruction}
'''
#创建一个prompt模板
prompt_template = PromptTemplate(input_variables=["text","format_instruction"], template=passage_template)
format_instruction = output_parser.get_format_instructions()
#实例化模板并且进行推理
prompt = prompt_template.format(text = passage, format_instruction = format_instruction)
response = llm.invoke(prompt)
#此时只是json样子的字符串
print(response)
##解析成真正的json
ans_json = output_parser.parse(response)
print(ans_json)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

二、记忆

如何实现让模型记住你的上下文内容，最简单的方式就是将之前的对话在下一轮提问时进行输入，Langchain通过Memory部分实现这里的功能：

### 记忆
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory,ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
#两种记忆类型
#ConversationBufferWindowMemory可以定义记忆的轮次
#ConversationBufferMemory则是简单的记忆所有
#memory = ConversationBufferWindowMemory(k=10)
memory = ConversationBufferMemory()

#需要通过一个对话链来装载
#verbose = True会显示prompt相关细节
conversation = ConversationChain(
    llm = llm,
    memory = memory,
    verbose=True
)
template = "问题：{question} ,只针对问题回答，不要发散问题！"
question_template = PromptTemplate(input_variables=["question"],template=template)
print(conversation.predict(input=question_template.format(question = "你好，我是徐!请介绍下你自己！")))
print("------------")
print(conversation.predict(input=question_template.format(question = "请问1+1等于几")))
print("------------")
print(conversation.predict(input="我的名字是什么?"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

更详细的部分可以查看Langchain文档中对Memory的描述

三、链（Chains）

langchain通过链来借助LLM实现不同的功能，比如上面的ConversationChain，这是Langchain的核心部分。
首先介绍以下最基本的LLMChain，它的功能就是实现推理

from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
question_prompt = PromptTemplate(
    input_variables=["question"],
    template="Question: {question}\nAnswer:"
)

# 创建LLMChain
qa_chain = LLMChain(llm=llm, prompt=question_prompt)
# 运行链
answer = qa_chain.run(question="What is the capital of France?")
print(answer)
1
2
3
4
5
6
7
8
9
10
11
12
13

当然你可以将多个链简单的顺序连接起来使用SimpleSequentialChain

from langchain.chains import SimpleSequentialChain
translate_prompt = PromptTemplate(
    input_variables=["english"],
    template="请将以下句子翻译成中文。\n\n{english}"
)
# 修改explain_prompt以接受翻译后的中文句子
explain_prompt = ChatPromptTemplate.from_template("请告诉我更多关于这句话的信息：{translated_text}")
# 创建LLMChain实例，注意chain_two的输入变量应该是translated_text
chain_one = LLMChain(llm=llm, prompt=translate_prompt)
chain_two = LLMChain(llm=llm, prompt=explain_prompt)
# 创建SimpleSequentialChain，确保chains列表的顺序正确
over_all_chain = SimpleSequentialChain(chains=[chain_one, chain_two], verbose=True)
# 运行整个链，传递原始英语句子
result = over_all_chain.run("Hello World!")
print(result)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

或者你可以更加自由的指定不同链之间的连接方式通过SequentialChain

## Sequential Chain
## SequentialChain 更自由的定义
from langchain.chains import SequentialChain
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema


first_prompt = PromptTemplate(
    input_variables=["english"],
    template="Please just translate the following sentence in to English.\n\nsentence:{Input}"
)

second_prompt = ChatPromptTemplate.from_template("Please tell what {English_output} is used for in computer science.")
third_prompt = ChatPromptTemplate.from_template("Please tell me the language name of following sentence. \n\nsentence:{Input} \nLanguage_name:")
forth_prompt = ChatPromptTemplate.from_template(
                                '''You are a translate expert, you can translate english to any language. Translate following whole sentence into Chinese, don't cut it off.
                                sentence:{Explain_output} translated:'''
                                  )
  

chain_one = LLMChain(llm = llm,prompt = first_prompt, output_key = "English_output")
chain_two = LLMChain(llm = llm, prompt = second_prompt, output_key = "Explain_output")
chain_three = LLMChain(llm =llm, prompt = third_prompt, output_key = "language")
chain_four = LLMChain(llm = llm, prompt = forth_prompt, output_key = "follow_message")
##指定输入和输出，及其ID
over_all_chain = SequentialChain(
    chains = [chain_one, chain_two, chain_three, chain_four],
    input_variables = ["Input"],
    output_variables = ["English_output","Explain_output","language","follow_message"],
    verbose = True
)
##运行方式不同
answer = over_all_chain("你好，世界！")
print(answer["Explain_output"])
print(answer["follow_message"])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

或者你想实现一个RouterChain智能的决定将要使用哪一个Chain

## Router Chain
## 路由链，根据输入进行转发到不同的链
## LLMRouterChain,借助大模型帮助在不同子链之间路由
## RouterOutputParser，解析LLM的输出，得到要使用哪条子链
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain, RouterOutputParser
from langchain.prompts import ChatPromptTemplate
physics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise\
and easy to understand manner. \
When you don't know the answer to a question you admit\
that you don't know.

Here is a question:
{input}

Assistant:"""

computerscience_template = """ You are a successful computer scientist.\
You have a passion for creativity, collaboration,\
forward-thinking, confidence, strong problem-solving capabilities,\
understanding of theories and algorithms, and excellent communication \
skills. You are great at answering coding questions. \
You are so good because you know how to solve a problem by \
describing the solution in imperative steps \
that a machine can easily interpret and you know how to \
choose a solution that has a good balance between \
time complexity and space complexity. 

Here is a question:
{input}


Assistant:"""

##description将决定使用什么prompt
prompt_info = [
    {
        "name":"physics",
        "description": "Good for answering questions about physics",
        "prompt_template":physics_template
    },
    {
        "name":"computer science",
        "description": "Good for answering questions about computer science",
        "prompt_template":computerscience_template
    },
]
##创建子链的集合供给Router Chain选择
destination_chains = {}
for p_info in prompt_info:
    name = p_info["name"]
    prompt_template = p_info['prompt_template']
    prompt = ChatPromptTemplate.from_template(prompt_template)
    chain = LLMChain(llm = llm, prompt = prompt)
    destination_chains[name] = chain
destinations = [f"{p['name']}: {p['description']}" for p in prompt_info]
destinations_str = "\n".join(destinations)

default_prompt = ChatPromptTemplate.from_template("{input}")
default_chain = LLMChain(llm=llm, prompt=default_prompt)

MULTI_PROMPT_ROUTER_TEMPLATE = """Given a raw text input to a \
language model select the model prompt best suited for the input. \
You will be given the names of the available prompts and a \
description of what the prompt is best suited for. \
You may also revise the original input if you think that revising\
it will ultimately lead to a better response from the language model.

<< FORMATTING >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
{{{{
    "destination": string \ name of the prompt to use or "DEFAULT"
    "next_inputs": string \ a potentially modified version of the original input
}}}}
```
REMEMBER: "destination" MUST be one of the candidate prompt \
names specified below OR it can be "DEFAULT" if the input is not\
well suited for any of the candidate prompts.
REMEMBER: "next_inputs" can just be the original input \
if you don't think any modifications are needed.

<< CANDIDATE PROMPTS >>
{destinations}

<< INPUT >>
{{input}}

<< OUTPUT (remember to include the ```json)>>"""

router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(
    destinations=destinations_str
)
router_prompt = PromptTemplate(
    template=router_template,
    input_variables=["input"],
    output_parser=RouterOutputParser(),
)
###定义转发的llm
router_chain = LLMRouterChain.from_llm(llm, router_prompt)

##指定router chain， destination chains， default_chain
chain = MultiPromptChain(router_chain=router_chain, 
                         destination_chains=destination_chains, 
                         default_chain=default_chain, verbose=True
                        )

answer = chain.run("什么是print(\"helloworld\")?")
print(answer)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110

四、构建知识库进行问答

简单来说，你上传一个文档文件，大模型如何根据其中的内容进行回答相关问题。主要流程一般为你得先通过一定的方法将相关的文档片段召回，加入到Pormpt中，最后通过LLM输出最终答案，在Langchain中我们可以通过RetrivalQA和构建的知识库实现该功能。

第一步：我们需要通过对文档分块并且进行Embedding存储向量库的方式构建一个知识库。

##构建知识库
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import CSVLoader,TextLoader

from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
##可以换成自己的文档
file = '0b46f7a2d67b5b59ad67cafffa0e12a9f0837790.txt'
loader = TextLoader(file_path=file, encoding='utf-8')
doc = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=100)
docs = text_splitter.split_documents(doc)
from modelscope import snapshot_download
# 初始化OpenAI嵌入
snapshot_download("iic/nlp_gte_sentence-embedding_chinese-base")
embeddings = HuggingFaceEmbeddings(model_name="./.cache/modelscope/hub/iic/nlp_gte_sentence-embedding_chinese-base")
# 创建向量数据库
db = Chroma.from_documents(docs, embeddings,persist_directory="db")
# 可选：保存向量数据库
db.persist()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

第二步，将知识库和LLM一起构建一个RetrievalQA链

#知识库问答
from langchain.chains import RetrievalQA
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="./.cache/modelscope/hub/iic/nlp_gte_sentence-embedding_chinese-base")
db = Chroma(persist_directory="db", embedding_function=embeddings)
##设置召回文件数量
retriever = db.as_retriever(search_kwargs={"k": 10})
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=db.as_retriever(), return_source_documents = False)

# 回答问题
# query = "文件中的公司名字是什么？"
# result = qa_chain({"query": query})
# print(result['result'])
#return_source_documents = True回答时需要invoke
# result = qa_chain.invoke("这个公司的主要营业内容是什么？")
# print(result['result'])
#print(result['source_documents'])
result = qa_chain.run("公司的主要资产情况如何？")
print(result)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

注意这里的RetrievalQA链是不带记忆的，可以使用ConversationalRetrivalChain实现带记忆的知识库问答

五、Tools、Agent（暂无）

到这里其实基本上你已经对Langchain有个基本了解了。余下可能还包括关于Tools和Agent部分的内容，这也是LangChain的核心部分，在这里由于视频中的部分已经老旧了，希望能够通过官方文档的样例进行学习Langchain中文网。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/木道寻08/article/detail/947219