当前位置:   article > 正文

大模型应用开发探索-基与gpt,glm的单个文档对话_unstructuredfileloader

unstructuredfileloader

大模型应用开发探索

一.环境配置

使用Anaconda进行开发环境隔离

mac:从anaconda官网进行下载安装:https://www.anaconda.com/

ubuntu服务器:

# 获取安装脚本
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh

# 运行脚本安装
bash ~/Downloads/Anaconda3-2021.11-Linux-x86_64.sh
  • 1
  • 2
  • 3
  • 4
  • 5

二.基于Langchain与OpenAI的pdf对话系统开发

环境配置

  1. 新建conda环境并激活
    conda create -n langchain python=3.9
    conda activate langchain 
    
    • 1
    • 2
  2. 安装主要的依赖包langchain,pytorch,transformers,openai
    pip install langchain pytorch transformers openai
    
    • 1

核心代码

1.进行pdf文档加载,使用langchain提供的UnstructuredFileLoader加载文档,将文档转换成一个长文本

此处需注意使用UnstructuredFileLoader需要先进行一些依赖项的配置

# 安装依赖项
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]
  • 1
  • 2
  • 3
  • 4
# 加载文档
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("test_pdf.pdf")
docs = loader.load()
print (f'You have {len(docs)} document(s) in your data')
print (f'There are {len(docs[0].page_content)} characters in your document')
docs
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
You have 1 document(s) in your data
There are 18467 characters in your document
[Document(page_content='高端装备制造行业信用风险回顾与 2023 年展望\n\n行业信用质量:2023 年基本保持稳定 ● 随着新冠疫情趋稳,经济运行持续恢复,政策与需 求双驱动,2023 年高端装备制造业将保持较快增 长态势,其中航空装备、卫星制造、工业机器人等 子行业将实现较快增长,轨交装备子行业增长幅度 较小。 ● 2023 年,高端装备市场竞争主体仍将以大中型央 国企为主,多数民企规模较小、产品较..',metadata={'source': 'test_pdf.pdf'})]]
  • 1
  • 2
  • 3
2.对文档分块,使用RecursiveCharacterTextSplitter对文档进行分块
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=0)
data = text_splitter.split_documents(docs)
data
  • 1
  • 2
  • 3
  • 4
  • 5
[Document(page_content='高端装备制造行业信用风险回顾与 2023 年展望', metadata={'source': 'test_pdf.pdf'}), Document(page_content='行业信用质量:2023 年基本保持稳定 ● 随着新冠疫情趋稳,经济运行持续恢复,政策与需 求双驱动,2023 年高端装备制造业将保持较快增 长态势,其中航空装备、卫星制造、工业机器人等 子行业将实现较快增长,轨交装备子行业增长幅度 较小。 ● 2023 年,高端装备市场竞争主体仍将以大中型央 国企为主,多数民企规模较小、产品较为单一,技 行业增速(%): 术研发相对较弱,易受行业竞争、市场波动、技术 更替等多重风险影响,抗风险能力相对较弱。 ● 高端装备制造企业 2023 年盈利能力和经营性净现 金流将进一步改善,其中航空装备、卫星制造等子 行业改善幅度较大,轨交装备子行业盈利小幅增长 但现金流受存货和应收账款拖累将略有下滑,工业 行业利润表现(亿元): 机器人盈利和现金流虽有所改善,但仍将处于较弱 状态。 ● 在投资和流动资金需求增长驱动下,2023 年高端 装备制造企业债务规模和资产负债率将增长,其中 卫星制造、工业机器人子行业债务增长较快,航空 和轨交装备等子行业债务增长较慢。 ● 2023 年,高端装备制造企业偿债能力整体将有所 行业偿债指标(倍): 改善,但子行业有所分化,其中航空装备和卫星制 造企业偿债能力将持续增强,轨交装备企业基本持 平,工业机器人企业仍较弱。 20172018201920202021 2022E 2023E T12zo7L0¢z02€0¢z072OTTzoz90Tzo7~ZOTZ0e600202i00ZOZ[a806102vO6TOeTI8TO?LO8TOZ€08T02OTLTO?9021022OLTO?irreBz 300201is)1°is)0FALL2017 2018 2019 2020 2021 2022E 2023EBES MCS OM LALA PER MAL EBITDA/2 BI 5S2017 2018', metadata={'source': 'test_pdf.pdf'}), Document(page_content='2019"EREome LALA2020 2021 2022E 2023Emn HOown LER', metadata={'source': 'test_pdf.pdf'})..]
  • 1
3.构建矢量数据库,使用OpenAIEmbedding对文档进行embedding,以及使用FAISS构建矢量数据库。其余矢量数据库(Chroma,Deep Lake ,Pinecone等)
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma,FAISS

embeddings = OpenAIEmbeddings()
vectordb = FAISS.from_documents(data, embedding=embeddings,
                                )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
4.构建检索QA链,检索QA链可以从矢量数据库中检索并回答问题。回答问题采用OpenAI的模型

temperature为生成结果的随机性,范围是(0-2),其中0为生成确定结果,temperature值越高,表示生成结果越随机。

top_p为生成概率

from langchain.chains import RetrievalQA

# 定义模版使用中文回答问题,此处也可不定义模板(会出现使用英文回答的情况)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

            {context}

            Question: {question}
            Answer in Chineses:"""
PROMPT = PromptTemplate(
                template=prompt_template, input_variables=["context", "question"]
            )
chain_type_kwargs = {"prompt": PROMPT}

# 定义检索的QA链,使用的模型为gpt-3.5-turbo
chain = RetrievalQA.from_chain_type(llm=OpenAI(verbose=True,temperature=temperature,top_p=top_p,model_name='gpt-3.5-turbo'), chain_type=chain_type, retriever=vectors.as_retriever())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
5.使用chain进行回答
query = "请用总结这篇文章的内容"
result = chain({"query": query, "chat_history": ""})
print(result)

  • 1
  • 2
  • 3
  • 4
{'query': '请用总结这篇文章的内容', 'chat_history': '', 'result': '本文主要讨论了2023年中国高端装备制造业的发展趋势和竞争格局,其中航空装备、卫星制造、机器人等子行业将实现较快增长,轨交装备子行业增长幅度较小。高端装备市场竞争主体仍将以大中型央国企为主,多数民企规模较小、产品较为单一、技术研发相对较弱,易受行业竞争、市场波动、技术更替等多重风险影响。此外,文章还讨论了卫星制造、工业机器人、铁路等领域的发展情况和企业信用评分预测。'}
  • 1

三.使用Langchain与ChatGLM_6B进行pdf对话系统开发

环境配置

1.conda新建环境

$ conda create -n langchain_GLM python=3.9
$ conda activate langchain_GLM
  • 1
  • 2

2.安装相关依赖

# 拉取仓库
git clone https://github.com/imClumsyPanda/langchain-ChatGLM.git

# 进入目录
cd langchain-ChatGLM

# 项目中 pdf 加载由先前的 detectron2 替换为使用 paddleocr,如果之前有安装过 detectron2 需要先完成卸载避免引发 tools 冲突
pip uninstall detectron2

# 检查paddleocr依赖,linux环境下paddleocr依赖libX11,libXext
sudo apt-get install libxpm-dev libxrandr-dev libxrender-dev libxres-dev libxss-dev libxt-dev libxtst-dev libxv-dev libxvmc-dev
sudo apt-get install 

# 安装依赖
pip install -r requirements.txt

# 验证paddleocr是否成功,首次运行会下载约18M模型到~/.paddleocr
python loader/image_loader.py
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

核心代码

1.加载模型

from transformers import AutoTokenizer, AutoModel, AutoConfig
from langchain.embeddings import HuggingFaceEmbeddings
tokenizer = AutoTokenizer.from_pretrained('/home/xs/software/chatglm-6b/',trust_remote_code=True)
config = AutoConfig.from_pretrained('/home/xs/software/chatglm-6b/',trust_remote_code=True)
model = AutoModel.from_pretrained('/home/xs/software/chatglm-6b/',config=config,trust_remote_code=True)
model = model.half().to('cuda:0')

embedding_model = HuggingFaceEmbeddings(model_name='GanymedeNil/text2vec-large-chinese',
                                                model_kwargs={'device': 'cuda'})
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

2.导入文档并进行文档分割

# 此处对UnstructuredFileLoader进行了重写
# 此处对TextSplitter方法进行了重写
from loader import UnstructuredPaddlePDFLoader
from textsplitter import ChineseTextSplitter
import loader
from langchain.vectorstores import FAISS

loader = UnstructuredPaddlePDFLoader('../langchain-pdf/test_pdf.pdf')
textsplitter = ChineseTextSplitter(pdf=True, sentence_size=150)
docs = loader.load_and_split(textsplitter)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

3.构建矢量数据库

from langchain.vectorstores import FAISS
vector_store = FAISS.from_documents(docs, embedding=embedding_model)
vector_store.save_local('./tmp/test')
  • 1
  • 2
  • 3

4.根据语义相似性进行数据库检索

chunk_conent为是否考虑上下文连接

score_threshold:知识相关度阈值,越低越好,建议设置为500以下,具体大小可根据测试结果来

k表示选出符合条件的前k条结果

# 此处重写了similarity_search_with_score_by_vector方法
from chains.local_doc_qa import similarity_search_with_score_by_vector
vector_store = FAISS.load_local('./tmp/test',embedding_model)
FAISS.similarity_search_with_score_by_vector = similarity_search_with_score_by_vector
vector_store.chunk_conent = True
vector_store.score_threshold = 0
vector_store.chunk_size = 250           
related_docs_with_score = vector_store.similarity_search_with_score("请总结这篇文章的内容", k=5)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

5.根据问题定义模版

# 定义模板
from typing import List
PROMPT_TEMPLATE = """已知信息:
{context} 

根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}"""

def generate_prompt(related_docs: List[str], query: str,
                    prompt_template=PROMPT_TEMPLATE) -> str:
    context = "\n".join([doc.page_content for doc in related_docs])
    prompt = prompt_template.replace("{question}", query).replace("{context}", context)
    return prompt

# 根据问题以及问题匹配的相关文档内容来进行生成模版
prompt = generate_prompt(related_docs=related_docs_with_score,query='请总结这篇文章的内容')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

6.模型进行回复

response,history = model.chat(query=prompt,tokenizer=tokenizer)
  • 1

四.模型发布与分享

streamlit

streamlit存在问题:每次组件发生改变的时候,streamlit会自上而下重新运行这个脚本,使用的不是很方便。后改用Gradio

Gradio

gradio提供了很多Web页面组件,方便快速将开发的AI应用进行简单部署与分享,整个页面的代码如下所示,最后通过访问http:localhost:7860可以访问该页面,也可以将demo.launch()show_api和share设置为True,然后会生成相关网页链接,别人可以访问这个应用。

from transformers import pipeline
import gradio as gr
import os
from langchain.document_loaders import PyPDFLoader # for loading the pdf
 # for loading the pdf
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma # for the vectorization part
from langchain.chains import ChatVectorDBChain # for chatting with the pdf
from langchain.llms import OpenAI
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import openai
from openai.error import InvalidRequestError
from langchain.vectorstores import Chroma,FAISS,Pinecone
import pinecone


# initialize pinecone
pinecone.init(
    api_key="90f83fd1-b92a-446d-9f95-d52fcce86799",  # find at app.pinecone.io
    environment="us-central1-gcp"  # next to api key in console
)
os.environ["OPENAI_API_KEY"] = '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'

chain = None
his_len = 20
is_trans = True


def load_file(file_obj):
    print(file_obj.name)
    print('正在处理中,请稍等.....')
    loader = UnstructuredPDFLoader(file_obj.name)
    global pages
    pages = loader.load()
    print('处理完毕')
    print (f'You have {len(pages)} document(s) in your data')
    print (f'There are {len(pages[0].page_content)} characters in your document')

def is_contains_english(strs):
    for _char in strs:
        if '\u4e00' <= _char <= '\u9fa5':
            return False
    return True

def conversational_chat(query,history):
    # print(history)
    global is_trans
    global his_len
    global chain
    history=[tuple(his) for his in history[-(his_len):]]
    
    try:
        # try:
        result = chain({"question": query, "chat_history": history})
        anwser =  result["answer"]
        # except:
        #     result = chain({"Question": query, "chat_history": history})
        #     anwser =  result["answer"]
    except InvalidRequestError as e:
        print(e._message)
    except ValueError:
        result = chain({"query": query, "chat_history": history})
        anwser = result["result"]
    print(anwser)
    anwser_chunk = anwser[:30]

    if is_trans== "True":
        if sum([is_contains_english(c) for c in anwser_chunk])>=18:
            openai.api_key = os.getenv("OPENAI_API_KEY")
            sentences = "请把接下来的段落翻译成中文。 段落:"+ anwser
            response = openai.Completion.create(
            model="text-davinci-003",
            prompt=sentences,
            temperature=0.0,
            top_p=1.0,
            max_tokens=1000,
            )
            anwser = response['choices'][0]['text'].strip().encode('utf-8').decode()


    history.append((query, anwser))
        
    # history_.append((query, result["answer"]))
    return "",history

def clear_session():
     history = []
     return '',None


def confirm_setting(chunk_size,chunk_overlap,embeddings,vectore_store,top_p,history_len,chain_type,chain_choose,large_language_model,temperature,is_translate):
    print('temperature',temperature)
    print('languagemodel',large_language_model)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    data = text_splitter.split_documents(pages)
    global his_len
    his_len = history_len

    global is_trans
    is_trans=is_translate
    print(his_len,is_trans)

    embeddings = OpenAIEmbeddings()
    if vectore_store == 'FAISS':
        vectors = FAISS.from_documents(data, embedding=embeddings)
    elif vectore_store == 'Chroma':
        vectors = Chroma.from_documents(data, embedding=embeddings)
    elif vectore_store == 'PineCone':
        vectore_store  = Pinecone.from_documents(data, embedding=embeddings,index_name="langchain-demo")


    global chain
    if chain_choose == 'ConversationalRetrievalChain':
        if large_language_model == 'gpt-3.5-turbo':
            chain = ConversationalRetrievalChain.from_llm(llm = OpenAI(temperature=temperature,model_name='gpt-3.5-turbo'),
                                                                                retriever=vectors.as_retriever(),chain_type=chain_type)
        elif large_language_model == 'text-davinci-003':
            chain = ConversationalRetrievalChain.from_llm(llm = OpenAI(temperature=temperature,model_name='text-davinci-003'),
                                                                                retriever=vectors.as_retriever(),chain_type=chain_type)
        elif large_language_model == 'text-curie-001':
            chain = ConversationalRetrievalChain.from_llm(llm = OpenAI(temperature=temperature,model_name='text-curie-001'),
                                                                                retriever=vectors.as_retriever(),chain_type=chain_type)
        elif large_language_model == 'davinci':
            chain = ConversationalRetrievalChain.from_llm(llm = OpenAI(temperature=temperature,model_name='davinci'),
                                                                                retriever=vectors.as_retriever(),chain_type=chain_type)
    elif chain_choose == 'RetrievalQA':
        if chain_type == 'stuff':
            prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

            {context}

            Question: {question}
            Answer in Chineses:"""
            PROMPT = PromptTemplate(
                template=prompt_template, input_variables=["context", "question"]
            )
            chain_type_kwargs = {"prompt": PROMPT}
        else:
            chain_type_kwargs = None
        print(chain_type_kwargs)
        if large_language_model == 'gpt-3.5-turbo':
            chain = RetrievalQA.from_chain_type(llm=OpenAI(verbose=True,temperature=temperature,top_p=top_p,model_name='gpt-3.5-turbo'), chain_type=chain_type, retriever=vectors.as_retriever(),chain_type_kwargs = chain_type_kwargs)
        elif large_language_model == 'text-davinci-003':
            chain = RetrievalQA.from_chain_type(llm=OpenAI(verbose=True,temperature=temperature,top_p=top_p,model_name='text-davinci-003'), chain_type=chain_type, retriever=vectors.as_retriever(),chain_type_kwargs = chain_type_kwargs)
        elif large_language_model == 'text-curie-001':
            chain = RetrievalQA.from_chain_type(llm=OpenAI(verbose=True,temperature=temperature,top_p=top_p,model_name='text-curie-001'), chain_type=chain_type, retriever=vectors.as_retriever(),chain_type_kwargs = chain_type_kwargs)
        elif large_language_model == 'davinci':
            chain = RetrievalQA.from_chain_type(llm=OpenAI(verbose=True,temperature=temperature,top_p=top_p,model_name='davinci'), chain_type=chain_type, retriever=vectors.as_retriever(),chain_type_kwargs = chain_type_kwargs)
    print('确定完毕')

with gr.Blocks() as demo:
    global history
    history = []

    gr.Markdown("""<h1><center>BIS-OPENAI</center></h1>
        <center><font size=3>
        基于大语言模型的PDF对话应用 <br>
        </center></font>
        """)
    with gr.Row():
        with gr.Column(scale=1):
            model_choose = gr.Accordion("模型选择")
            with model_choose:
                large_language_model = gr.Dropdown(
                    ['gpt-3.5-turbo','text-davinci-003','text-curie-001',"davinci"],
                    label="large language model",
                    value="gpt-3.5-turbo",interactive=True)
                
                embedding_model = gr.Dropdown(["OpenAIEmbeddings"],
                                                label="Embedding model",
                                                value="OpenAIEmbeddings",interactive=True)
                
            file = gr.File(label='请上传pdf文件',
                                file_types=['.pdf'])

            btn_parser = gr.Button('解析文件',)

            # status = gr.Accordion("状态栏显示")
            btn_parser.click(load_file,file)
            is_translate = gr.Radio(["True", "False"],
                            label="Translate",
                            value="True")
            
        with gr.Column(scale=1):
            model_argument = gr.Accordion("模型参数配置")
            with model_argument:
                top_p = gr.Slider(0,
                                    1.0,
                                    value=1.0,
                                    step=0.1,
                                    label="top_p",
                                    interactive=True)

                history_len = gr.Slider(0,
                                        20,
                                        value=20,
                                        step=1,
                                        label="history len",
                                        interactive=True)

                temperature = gr.Slider(0.0,
                                        1.0,
                                        value=0.0,
                                        step=0.1,
                                        label="temperature",
                                        interactive=True)
                
                chunk_size = gr.Slider(0,
                                        3000,
                                        value=800,
                                        step=100,
                                        label="chunk_size",
                                        interactive=True)
                
                chunk_overlap = gr.Slider(0,
                                        1500,
                                        value=100,
                                        step=50,
                                        label="chunk_overlap",
                                        interactive=True)
                
                vectore_store = gr.Dropdown(
                    ['FAISS','Chroma','Deep Lake','PineCone'],
                    label="vectore store",
                    value="FAISS",interactive=True)
                
                chain_choose = gr.Dropdown(['ConversationalRetrievalChain','RetrievalQA'],                                                       
                    label="qa_chain",
                    value="RetrievalQA",interactive=True)

                chain_type = gr.Dropdown(
                    ['stuff','map_reduce','refine','map_rerank'],
                    label="chain_type",
                    value="stuff",interactive=True)
                
                btn_save = gr.Button("确定模型设置")
                
                btn_save.click(confirm_setting,inputs=[chunk_size,chunk_overlap,embedding_model,vectore_store,top_p,history_len,chain_type,chain_choose,large_language_model,temperature,is_translate])
                

        with gr.Column(scale=4):
                chatbot = gr.Chatbot(show_label=False).style(height=500)
                message = gr.Textbox(label='请输入问题')
                state = gr.State()

                with gr.Row():
                    clear_history = gr.Button("
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/356634
推荐阅读
相关标签