当前位置:   article > 正文

Elasticsearch:与多个 PDF 聊天 | LangChain Python 应用教程(免费 LLMs 和嵌入)_huggingface pdf

huggingface pdf

在本博客中,你将学习创建一个 LangChain 应用程序,以使用 ChatGPT API 和 Huggingface 语言模型与多个 PDF 文件聊天。

如上所示,我们在最最左边摄入 PDF 文件,并它们连成一起,并分为不同的 chunks。我们可以通过使用 huggingface 来对 chunks 进行处理并形成 embeddings。我们把 embeddings 写入到 Elasticsearch 向量数据库中,并保存。在搜索的时候,我们通过 LangChain 来进行向量化,并使用 Elasticsearch 进行向量搜索。在最后,我们通过大模型的使用,针对提出的问题来进行提问。我们最终的界面如下:

如上所示,它可以针对我们的问题进行回答。进一步阅读 

所有的源码可以在地址 GitHub - liu-xiao-guo/ask-multiple-pdfs: A Langchain app that allows you to chat with multiple PDFs 进行下载。

安装

如果你还没有安装好自己的 Elasticsearch 及 Kibana 的话,那么请参考如下的链接:

在安装的时候,我们选择 Elastic Stack 9.x 的安装指南来进行安装。在默认的情况下,Elasticsearch 集群的访问具有 HTTPS 的安全访问。

在安装时,我们可以在 Elasticsearch 的如下地址找到相应的证书文件 http_ca.crt:

  1. $ pwd
  2. /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs
  3. $ ls
  4. http.p12 http_ca.crt transport.p12

我们需要把该证书拷贝到项目文件的根目录下:

  1. $ tree -L 3
  2. .
  3. ├── app.py
  4. ├── docs
  5. │   └── PDF-LangChain.jpg
  6. ├── htmlTemplates.py
  7. ├── http_ca.crt
  8. ├── lib_embeddings.py
  9. ├── lib_indexer.py
  10. ├── lib_llm.py
  11. ├── lib_vectordb.py
  12. ├── myapp.py
  13. ├── pdf_files
  14. │   ├── sample1.pdf
  15. │   └── sample2.pdf
  16. ├── readme.md
  17. ├── requirements.txt
  18. └── simple.cfg

如上所示,我们把 http_ca.crt 拷贝到应用的根目录下。我们在 pdf_files 里放了两个用于测试的 PDF 文件。你可以使用自己的 PDF 文件来进行测试。我们在 simple.cfg 做如下的配置:

  1. ES_SERVER: "localhost"
  2. ES_PASSWORD: "vXDWYtL*my3vnKY9zCfL"
  3. ES_FINGERPRINT: "e2c1512f617f432ddf242075d3af5177b28f6497fecaaa0eea11429369bb7b00"

在上面,我们需要配置 ES_SERVER。这个是 Elasticsearch 集群的地址。这里的 ES_PASSWORD 是 Elasticsearch 的超级用户 elastic 的密码。我们可以在 Elasticsearch 第一次启动的画面中找到这个 ES_FINGERPRINT:

你还可以在 Kibana 的配置文件 confgi/kibana.yml 文件中获得 fingerprint 的配置:

在项目的目录中,我们还可以看到一个叫做 .env-example 的文件。我们可以使用如下的命令把它重新命名为 .env:

mv .env.example .env

在 .env 中,我们输入 huggingface.co 网站得到的 token:

  1. $ cat .env
  2. OPENAI_API_KEY=your_openai_key
  3. HUGGINGFACEHUB_API_TOKEN=your_huggingface_key

在本例中,我们将使用 huggingface 来进行测试。如果你需要使用到 OpenAI,那么你需要配置它的 key。有关 huggingface 的开发者 key,你可以在地址获得。

运行项目

在运行项目之前,你需要做一下安装的动作:

  1. python3 -m venv env
  2. source env/bin/activate
  3. python3 -m pip install --upgrade pip
  4. pip install -r requirements.txt

创建界面

本应用的界面,我们采用是 streamlit 来创建的。它的创建也是非常地简单。我们可以在 myapp.py 中看到如下的代码:

myapp.py

  1. import streamlit as st
  2. from dotenv import load_dotenv
  3. from PyPDF2 import PdfReader
  4. from htmlTemplates import css, bot_template, user_template
  5. def get_pdf_texts(pdf_docs):
  6. text = ""
  7. for pdf in pdf_docs:
  8. pdf_reader = PdfReader(pdf)
  9. for page in pdf_reader.pages:
  10. text += page.extract_text()
  11. return text
  12. def main():
  13. load_dotenv()
  14. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
  15. st.write(css, unsafe_allow_html=True)
  16. st.header("Chat with multiple PDFs :books:")
  17. user_question = st.text_input("Ask a question about your documents")
  18. if user_question:
  19. pass
  20. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
  21. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
  22. # Add a side bar
  23. with st.sidebar:
  24. st.subheader("Your documents")
  25. pdf_docs = st.file_uploader(
  26. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
  27. print(pdf_docs)
  28. if st.button("Process"):
  29. with st.spinner("Processing"):
  30. # Get pdf text from
  31. raw_text = get_pdf_texts(pdf_docs)
  32. st.write(raw_text)
  33. if __name__ == "__main__":
  34. main()

在上面的代码中,我创建了一个 sidebar 用来选择需要的 PDF 文件。我们可以点击 Process 按钮来显示已经提取的 PDF 文本。我们可以使用如下的命令来运行应用:

(venv) $ streamlit run myapp.py
  1. venv) $ streamlit run myapp.py
  2. You can now view your Streamlit app in your browser.
  3. Local URL: http://localhost:8502
  4. Network URL: http://198.18.1.13:8502

运行完上面的命令后,我们可以在浏览器中打开应用:

我们点击 Browse files,并选中 PDF 文件:

点击上面的 Process,我们可以看到:

在上面,我们为了显示的方便,我使用 st.write 直接把结果写到浏览器的页面里。我们接下来需要针对这个长的文字进行切分为一个一个的 chunks。我们需要按照模型的需要,不能超过模型允许的最大值。

上面我简单地叙述了 UI 的构造。最终完整的 myapp.py 的设计如下:

myapp.py

  1. import streamlit as st
  2. from dotenv import load_dotenv
  3. from PyPDF2 import PdfReader
  4. from langchain.text_splitter import CharacterTextSplitter
  5. from langchain.text_splitter import RecursiveCharacterTextSplitter
  6. from langchain.embeddings import OpenAIEmbeddings
  7. from htmlTemplates import css, bot_template, user_template
  8. import lib_indexer
  9. import lib_llm
  10. import lib_embeddings
  11. import lib_vectordb
  12. index_name = "pdf_docs"
  13. def get_pdf_text(pdf):
  14. text = ""
  15. pdf_reader = PdfReader(pdf)
  16. for page in pdf_reader.pages:
  17. text += page.extract_text()
  18. return text
  19. def get_pdf_texts(pdf_docs):
  20. text = ""
  21. for pdf in pdf_docs:
  22. pdf_reader = PdfReader(pdf)
  23. for page in pdf_reader.pages:
  24. text += page.extract_text()
  25. return text
  26. def get_text_chunks(text):
  27. text_splitter = CharacterTextSplitter(
  28. separator="\n",
  29. chunk_size=1000,
  30. chunk_overlap=200,
  31. length_function=len
  32. )
  33. chunks = text_splitter.split_text(text)
  34. # chunks = text_splitter.split_documents(text)
  35. return chunks
  36. def get_text_chunks1(text):
  37. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
  38. chunks = text_splitter.split_text(text)
  39. return chunks
  40. def handle_userinput(db, llm_chain_informed, user_question):
  41. similar_docs = db.similarity_search(user_question)
  42. print(f'The most relevant passage: \n\t{similar_docs[0].page_content}')
  43. ## 4. Ask Local LLM context informed prompt
  44. # print(">> 4. Asking The Book ... and its response is: ")
  45. informed_context= similar_docs[0].page_content
  46. response = llm_chain_informed.run(context=informed_context,question=user_question)
  47. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
  48. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)
  49. def main():
  50. # # Huggingface embedding setup
  51. hf = lib_embeddings.setup_embeddings()
  52. # # # ## Elasticsearch as a vector db
  53. db, url = lib_vectordb.setup_vectordb(hf, index_name)
  54. # # # ## set up the conversational LLM
  55. llm_chain_informed= lib_llm.make_the_llm()
  56. load_dotenv()
  57. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
  58. st.write(css, unsafe_allow_html=True)
  59. st.header("Chat with multiple PDFs :books:")
  60. user_question = st.text_input("Ask a question about your documents")
  61. if user_question:
  62. handle_userinput(db, llm_chain_informed, user_question)
  63. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
  64. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
  65. # Add a side bar
  66. with st.sidebar:
  67. st.subheader("Your documents")
  68. pdf_docs = st.file_uploader(
  69. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
  70. print(pdf_docs)
  71. if st.button("Process"):
  72. with st.spinner("Processing"):
  73. # Get pdf text from
  74. # raw_text = get_pdf_text(pdf_docs[0])
  75. raw_text = get_pdf_texts(pdf_docs)
  76. # st.write(raw_text)
  77. print(raw_text)
  78. # Get the text chunks
  79. text_chunks = get_text_chunks(raw_text)
  80. # st.write(text_chunks)
  81. # Create vector store
  82. lib_indexer.loadPdfChunks(text_chunks, url, hf, db, index_name)
  83. if __name__ == "__main__":
  84. main()

创建嵌入模型

lib_embedding.py

  1. ## for embeddings
  2. from langchain.embeddings import HuggingFaceEmbeddings
  3. def setup_embeddings():
  4. # Huggingface embedding setup
  5. print(">> Prep. Huggingface embedding setup")
  6. model_name = "sentence-transformers/all-mpnet-base-v2"
  7. return HuggingFaceEmbeddings(model_name=model_name)

 创建向量存储

lib_vectordb.py

  1. import os
  2. from config import Config
  3. ## for vector store
  4. from langchain.vectorstores import ElasticVectorSearch
  5. def setup_vectordb(hf,index_name):
  6. # Elasticsearch URL setup
  7. print(">> Prep. Elasticsearch config setup")
  8. with open('simple.cfg') as f:
  9. cfg = Config(f)
  10. endpoint = cfg['ES_SERVER']
  11. username = "elastic"
  12. password = cfg['ES_PASSWORD']
  13. ssl_verify = {
  14. "verify_certs": True,
  15. "basic_auth": (username, password),
  16. "ca_certs": "./http_ca.crt",
  17. }
  18. url = f"https://{username}:{password}@{endpoint}:9200"
  19. return ElasticVectorSearch( embedding = hf,
  20. elasticsearch_url = url,
  21. index_name = index_name,
  22. ssl_verify = ssl_verify), url

 创建使用带有上下文和问题变量的提示模板的离线 LLM

lib_llm.py

  1. ## for conversation LLM
  2. from langchain import PromptTemplate, HuggingFaceHub, LLMChain
  3. from langchain.llms import HuggingFacePipeline
  4. import torch
  5. from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
  6. def make_the_llm():
  7. # Get Offline flan-t5-large ready to go, in CPU mode
  8. print(">> Prep. Get Offline flan-t5-large ready to go, in CPU mode")
  9. model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
  10. tokenizer = AutoTokenizer.from_pretrained(model_id)
  11. model = AutoModelForSeq2SeqLM.from_pretrained(model_id) #load_in_8bit=True, device_map='auto'
  12. pipe = pipeline(
  13. "text2text-generation",
  14. model=model,
  15. tokenizer=tokenizer,
  16. max_length=100
  17. )
  18. local_llm = HuggingFacePipeline(pipeline=pipe)
  19. # template_informed = """
  20. # I know the following: {context}
  21. # Question: {question}
  22. # Answer: """
  23. template_informed = """
  24. I know: {context}
  25. when asked: {question}
  26. my response is: """
  27. prompt_informed = PromptTemplate(template=template_informed, input_variables=["context", "question"])
  28. return LLMChain(prompt=prompt_informed, llm=local_llm)

写入以向量表示的 PDF 文件

以下是我的分块和向量存储代码。 它需要在 Elasticsearch 中准备好组成的 Elasticsearch url、huggingface 嵌入模型、向量数据库和目标索引名称

lib_indexer.py

  1. from langchain.text_splitter import RecursiveCharacterTextSplitter
  2. from langchain.document_loaders import TextLoader
  3. ## for vector store
  4. from langchain.vectorstores import ElasticVectorSearch
  5. from elasticsearch import Elasticsearch
  6. from config import Config
  7. with open('simple.cfg') as f:
  8. cfg = Config(f)
  9. fingerprint = cfg['ES_FINGERPRINT']
  10. endpoint = cfg['ES_SERVER']
  11. username = "elastic"
  12. password = cfg['ES_PASSWORD']
  13. ssl_verify = {
  14. "verify_certs": True,
  15. "basic_auth": (username, password),
  16. "ca_certs": "./http_ca.crt"
  17. }
  18. url = f"https://{username}:{password}@{endpoint}:9200"
  19. def parse_book(filepath):
  20. loader = TextLoader(filepath)
  21. documents = loader.load()
  22. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
  23. docs = text_splitter.split_documents(documents)
  24. return docs
  25. def parse_triplets(filepath):
  26. docs = parse_book(filepath)
  27. result = []
  28. for i in range(len(docs) - 2):
  29. concat_str = docs[i].page_content + " " + docs[i+1].page_content + " " + docs[i+2].page_content
  30. result.append(concat_str)
  31. return result
  32. #db.from_texts(docs, embedding=hf, elasticsearch_url=url, index_name=index_name)
  33. ## load book utility
  34. ## params
  35. ## filepath: where to get the book txt ... should be utf-8
  36. ## url: the full Elasticsearch url with username password and port embedded
  37. ## hf: hugging face transformer for sentences
  38. ## db: the VectorStore Langcahin object ready to go with embedding thing already set up
  39. ## index_name: name of index to use in ES
  40. ##
  41. ## will check if the index_name exists already in ES url before attempting split and load
  42. def loadBookTriplets(filepath, url, hf, db, index_name):
  43. with open('simple.cfg') as f:
  44. cfg = Config(f)
  45. fingerprint = cfg['ES_FINGERPRINT']
  46. es = Elasticsearch( [ url ],
  47. basic_auth = ("elastic", cfg['ES_PASSWORD']),
  48. ssl_assert_fingerprint = fingerprint,
  49. http_compress = True )
  50. ## Parse the book if necessary
  51. if not es.indices.exists(index=index_name):
  52. print(f'\tThe index: {index_name} does not exist')
  53. print(">> 1. Chunk up the Source document")
  54. results = parse_triplets(filepath)
  55. print(">> 2. Index the chunks into Elasticsearch")
  56. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
  57. embedding = hf,
  58. elasticsearch_url = url,
  59. index_name = index_name,
  60. ssl_verify = ssl_verify)
  61. else:
  62. print("\tLooks like the pdfs are already loaded, let's move on")
  63. def loadBookBig(filepath, url, hf, db, index_name):
  64. es = Elasticsearch( [ url ],
  65. basic_auth = ("elastic", cfg['ES_PASSWORD']),
  66. ssl_assert_fingerprint = fingerprint,
  67. http_compress = True )
  68. ## Parse the book if necessary
  69. if not es.indices.exists(index=index_name):
  70. print(f'\tThe index: {index_name} does not exist')
  71. print(">> 1. Chunk up the Source document")
  72. docs = parse_book(filepath)
  73. # print(docs)
  74. print(">> 2. Index the chunks into Elasticsearch")
  75. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
  76. embedding = hf,
  77. elasticsearch_url = url,
  78. index_name = index_name,
  79. ssl_verify = ssl_verify)
  80. else:
  81. print("\tLooks like the pdfs are already loaded, let's move on")
  82. def loadPdfChunks(chunks, url, hf, db, index_name):
  83. es = Elasticsearch( [ url ],
  84. basic_auth = ("elastic", cfg['ES_PASSWORD']),
  85. ssl_assert_fingerprint = fingerprint,
  86. http_compress = True )
  87. ## Parse the book if necessary
  88. if not es.indices.exists(index=index_name):
  89. print(f'\tThe index: {index_name} does not exist')
  90. print(">> 2. Index the chunks into Elasticsearch")
  91. print("url: ", url)
  92. print("index_name", index_name)
  93. elastic_vector_search = db.from_texts( chunks,
  94. embedding = hf,
  95. elasticsearch_url = url,
  96. index_name = index_name,
  97. ssl_verify = ssl_verify)
  98. else:
  99. print("\tLooks like the pdfs are already loaded, let's move on")

提问

我们使用 streamlit 的 input 来进行提问:

  1. user_question = st.text_input("Ask a question about your documents")
  2. if user_question:
  3. handle_userinput(db, llm_chain_informed, user_question)

当我们打入 ENTER 键后,上面的代码调用 handle_userinput(db, llm_chain_informed, user_question):

  1. def handle_userinput(db, llm_chain_informed, user_question):
  2. similar_docs = db.similarity_search(user_question)
  3. print(f'The most relevant passage: \n\t{similar_docs[0].page_content}')
  4. ## 4. Ask Local LLM context informed prompt
  5. # print(">> 4. Asking The Book ... and its response is: ")
  6. informed_context= similar_docs[0].page_content
  7. response = llm_chain_informed.run(context=informed_context,question=user_question)
  8. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
  9. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)

首先它使用 db 进行相似性搜索,然后我们再使用大模型来得到我们想要的答案。

运行结果

我们使用命令来运行代码:

streamlit run myapp.py

我们在浏览器中选择在 pdf_files 中的两个 PDF 文件:

在上面,我们输入想要的问题:

上面的问题是:

what do I make all the same and put a cup next to him on the desk?

再进行提问:

上面的问题是:

when should you come? I will send a car to meet you from the half past four arrival at Harrogate Station.

上面的问题是:

what will I send to meet you from the half past four arrival at Harrogate Station?

你进行多次尝试其它的问题。Happy journery :)

有关 ChatGPT 的使用也是基本相同的。你需要使用 ChatGPT 的模型及其相应的 key 即可。在这里就不赘述了。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/126686
推荐阅读
相关标签
  

闽ICP备14008679号