当前位置:   article > 正文

使用 Chainlit, Langchain 及 Elasticsearch 轻松实现对 PDF 文件的查询_retrievalqawithsourceschain elasticsearch

retrievalqawithsourceschain elasticsearch

在我之前的文章 “Elasticsearch:与多个 PDF 聊天 | LangChain Python 应用教程(免费 LLMs 和嵌入)” 里,我详述如何使用 Streamlit,Langchain, Elasticsearch 及 OpenAI 来针对 PDF 进行聊天。在今天的文章中,我将使用 Chainlit 来展示如使用 Langchain 及 Elasticsearch 针对 PDF 文件进行查询。

为方便大家学习,我的代码在地址 GitHub - liu-xiao-guo/langchain-openai-chainlit: Chat with your documents (pdf, csv, text) using Openai model, LangChain and Chainlit 进行下载。

安装

安装 Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,那么请参考一下的文章来进行安装:

在安装的时候,请选择 Elastic Stack 8.x 进行安装。在安装的时候,我们可以看到如下的安装信息:

 拷贝 Elasticsearch 证书

我们把 Elasticsearch 的证书拷贝到当前的目录下:

  1. $ pwd
  2. /Users/liuxg/python/elser
  3. $ cp ~/elastic/elasticsearch-8.12.0/config/certs/http_ca.crt .
  4. $ ls http_ca.crt
  5. http_ca.crt

安装 Python 依赖包

我们在当前的目录下打入如下的命令:

  1. python3 -m venv .venv
  2. source .venv/bin/activate

然后,我们再打入如下的命令:

  1. $ pwd
  2. /Users/liuxg/python/langchain-openai-chainlit
  3. $ source .venv/bin/activate
  4. (.venv) $ pip3 install -r requirements.txt

运行应用

有关 Chainlit 的更多知识请参考 Overview - Chainlit。这里就不再赘述。有关 pdf_qa.py 的代码如下:

pdf_qa.py

  1. # Import necessary modules and define env variables
  2. # from langchain.embeddings.openai import OpenAIEmbeddings
  3. from langchain_openai import OpenAIEmbeddings
  4. from langchain.text_splitter import RecursiveCharacterTextSplitter
  5. from langchain.chains import RetrievalQAWithSourcesChain
  6. from langchain_openai import ChatOpenAI
  7. from langchain.prompts.chat import (
  8. ChatPromptTemplate,
  9. SystemMessagePromptTemplate,
  10. HumanMessagePromptTemplate,
  11. )
  12. import os
  13. import io
  14. import chainlit as cl
  15. import PyPDF2
  16. from io import BytesIO
  17. from pprint import pprint
  18. import inspect
  19. # from langchain.vectorstores import ElasticsearchStore
  20. from langchain_community.vectorstores import ElasticsearchStore
  21. from elasticsearch import Elasticsearch
  22. from dotenv import load_dotenv
  23. # Load environment variables from .env file
  24. load_dotenv()
  25. OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")
  26. ES_USER = os.getenv("ES_USER")
  27. ES_PASSWORD = os.getenv("ES_PASSWORD")
  28. elastic_index_name='pdf_docs'
  29. # text_splitter and system template
  30. text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
  31. system_template = """Use the following pieces of context to answer the users question.
  32. If you don't know the answer, just say that you don't know, don't try to make up an answer.
  33. ALWAYS return a "SOURCES" part in your answer.
  34. The "SOURCES" part should be a reference to the source of the document from which you got your answer.
  35. Example of your response should be:
  36. ```
  37. The answer is foo
  38. SOURCES: xyz
  39. ```
  40. Begin!
  41. ----------------
  42. {summaries}"""
  43. messages = [
  44. SystemMessagePromptTemplate.from_template(system_template),
  45. HumanMessagePromptTemplate.from_template("{question}"),
  46. ]
  47. prompt = ChatPromptTemplate.from_messages(messages)
  48. chain_type_kwargs = {"prompt": prompt}
  49. @cl.on_chat_start
  50. async def on_chat_start():
  51. # Sending an image with the local file path
  52. elements = [
  53. cl.Image(name="image1", display="inline", path="./robot.jpeg")
  54. ]
  55. await cl.Message(content="Hello there, Welcome to AskAnyQuery related to Data!", elements=elements).send()
  56. files = None
  57. # Wait for the user to upload a PDF file
  58. while files is None:
  59. files = await cl.AskFileMessage(
  60. content="Please upload a PDF file to begin!",
  61. accept=["application/pdf"],
  62. max_size_mb=20,
  63. timeout=180,
  64. ).send()
  65. file = files[0]
  66. # print("type: ", type(file))
  67. # print("file: ", file)
  68. # pprint(vars(file))
  69. # print(file.content)
  70. msg = cl.Message(content=f"Processing `{file.name}`...")
  71. await msg.send()
  72. # Read the PDF file
  73. # pdf_stream = BytesIO(file.content)
  74. with open(file.path, 'rb') as f:
  75. pdf_content = f.read()
  76. pdf_stream = BytesIO(pdf_content)
  77. pdf = PyPDF2.PdfReader(pdf_stream)
  78. pdf_text = ""
  79. for page in pdf.pages:
  80. pdf_text += page.extract_text()
  81. # Split the text into chunks
  82. texts = text_splitter.split_text(pdf_text)
  83. # Create metadata for each chunk
  84. metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]
  85. # Create a Chroma vector store
  86. embeddings = OpenAIEmbeddings()
  87. url = f"https://{ES_USER}:{ES_PASSWORD}@localhost:9200"
  88. connection = Elasticsearch(
  89. hosts=[url],
  90. ca_certs = "./http_ca.crt",
  91. verify_certs = True
  92. )
  93. docsearch = None
  94. if not connection.indices.exists(index=elastic_index_name):
  95. print("The index does not exist, going to generate embeddings")
  96. docsearch = await cl.make_async(ElasticsearchStore.from_texts)(
  97. texts,
  98. embedding = embeddings,
  99. es_url = url,
  100. es_connection = connection,
  101. index_name = elastic_index_name,
  102. es_user = ES_USER,
  103. es_password = ES_PASSWORD,
  104. metadatas=metadatas
  105. )
  106. else:
  107. print("The index already existed")
  108. docsearch = ElasticsearchStore(
  109. es_connection=connection,
  110. embedding=embeddings,
  111. es_url = url,
  112. index_name = elastic_index_name,
  113. es_user = ES_USER,
  114. es_password = ES_PASSWORD
  115. )
  116. # Create a chain that uses the Chroma vector store
  117. chain = RetrievalQAWithSourcesChain.from_chain_type(
  118. ChatOpenAI(temperature=0),
  119. chain_type="stuff",
  120. retriever=docsearch.as_retriever(search_kwargs={"k": 4}),
  121. )
  122. # Save the metadata and texts in the user session
  123. cl.user_session.set("metadatas", metadatas)
  124. cl.user_session.set("texts", texts)
  125. # Let the user know that the system is ready
  126. msg.content = f"Processing `{file.name}` done. You can now ask questions!"
  127. await msg.update()
  128. cl.user_session.set("chain", chain)
  129. @cl.on_message
  130. async def main(message:str):
  131. chain = cl.user_session.get("chain") # type: RetrievalQAWithSourcesChain
  132. print("chain type: ", type(chain))
  133. cb = cl.AsyncLangchainCallbackHandler(
  134. stream_final_answer=True, answer_prefix_tokens=["FINAL", "ANSWER"]
  135. )
  136. cb.answer_reached = True
  137. print("message: ", message)
  138. pprint(vars(message))
  139. print(message.content)
  140. res = await chain.acall(message.content, callbacks=[cb])
  141. answer = res["answer"]
  142. sources = res["sources"].strip()
  143. source_elements = []
  144. # Get the metadata and texts from the user session
  145. metadatas = cl.user_session.get("metadatas")
  146. all_sources = [m["source"] for m in metadatas]
  147. texts = cl.user_session.get("texts")
  148. print("texts: ", texts)
  149. if sources:
  150. found_sources = []
  151. # Add the sources to the message
  152. for source in sources.split(","):
  153. source_name = source.strip().replace(".", "")
  154. # Get the index of the source
  155. try:
  156. index = all_sources.index(source_name)
  157. except ValueError:
  158. continue
  159. text = texts[index]
  160. found_sources.append(source_name)
  161. # Create the text element referenced in the message
  162. source_elements.append(cl.Text(content=text, name=source_name))
  163. if found_sources:
  164. answer += f"\nSources: {', '.join(found_sources)}"
  165. else:
  166. answer += "\nNo sources found"
  167. if cb.has_streamed_final_answer:
  168. cb.final_stream.elements = source_elements
  169. await cb.final_stream.update()
  170. else:
  171. await cl.Message(content=answer, elements=source_elements).send()

我们可以使用如下的命令来运行:

  1. export ES_USER="elastic"
  2. export ES_PASSWORD="xnLj56lTrH98Lf_6n76y"
  3. export OPENAI_API_KEY="YourOpenAiKey"
  4. chainlit run pdf_qa.py -w
  1. (.venv) $ chainlit run pdf_qa.py -w
  2. 2024-02-14 10:58:30 - Loaded .env file
  3. 2024-02-14 10:58:33 - Your app is available at http://localhost:8000
  4. 2024-02-14 10:58:34 - Translation file for en not found. Using default translation en-US.
  5. 2024-02-14 10:58:35 - 2 changes detected

我们先选择项目自带的 pdf 文件:

Is sample PDF download critical to an organization?

Does comprehensive PDF testing have various advantages?

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/126691
推荐阅读
相关标签
  

闽ICP备14008679号