赞
踩
转载自:【RAG实践】基于LlamaIndex和Qwen1.5搭建基于本地知识库的问答机器人
https://mp.weixin.qq.com/s/RwjywuzfswSCKP_lw135YA
源码地址:
https://github.com/modelscope/modelscope/blob/master/examples/pytorch/application/qwen1.5_doc_search_QA_based_on_langchain.ipynb
qwen/Qwen1.5-4B-Chat
模型iic/nlp_gte_sentence-embedding_chinese-base
作为 Embedding 模型pip install llama-index llama-index-llms-huggingface ipywidgets
pip install transformers -U
西安交大的介绍
https://modelscope.oss-cn-beijing.aliyuncs.com/resource/rag/xianjiaoda.md
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts import PromptTemplate
from modelscope import snapshot_download
from llama_index.core.base.embeddings.base import BaseEmbedding, Embedding
from abc import ABC
from typing import Any, List, Optional, Dict, cast
from llama_index.core import (
VectorStoreIndex,
ServiceContext,
set_global_service_context,
SimpleDirectoryReader,
)
使用 modelscope 的 snapshot_download 下载数据;
因为Qwen本次支持了 Transformers,使用HuggingFaceLLM加载模型,模型为(Qwen1.5-4B-Chat)
qwen2_4B_CHAT = "qwen/Qwen1.5-4B-Chat"
selected_model = snapshot_download(qwen2_4B_CHAT)
SYSTEM_PROMPT = """You are a helpful AI assistant.
"""
query_wrapper_prompt = PromptTemplate(
"[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=2048,
generate_kwargs={"temperature": 0.0, "do_sample": False},
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name=selected_model,
model_name=selected_model,
device_map="auto",
# change these settings below depending on your GPU
model_kwargs={"torch_dtype": torch.float16},
)
加载GTE模型,使用GTE模型构造Embedding类
embedding_model = "iic/nlp_gte_sentence-embedding_chinese-base"
class ModelScopeEmbeddings4LlamaIndex(BaseEmbedding, ABC):
embed: Any = None
model_id: str = "iic/nlp_gte_sentence-embedding_chinese-base"
def __init__(
self,
model_id: str,
**kwargs: Any,
) -> None:
super().__init__(**kwargs)
try:
from modelscope.models import Model
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
# 使用modelscope的embedding模型(包含下载)
self.embed = pipeline(Tasks.sentence_embedding, model=self.model_id)
except ImportError as e:
raise ValueError(
"Could not import some python packages." "Please install it with `pip install modelscope`."
) from e
def _get_query_embedding(self, query: str) -> List[float]:
text = query.replace("\n", " ")
inputs = {"source_sentence": [text]}
return self.embed(input=inputs)['text_embedding'][0].tolist()
def _get_text_embedding(self, text: str) -> List[float]:
text = text.replace("\n", " ")
inputs = {"source_sentence": [text]}
return self.embed(input=inputs)['text_embedding'][0].tolist()
def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
texts = list(map(lambda x: x.replace("\n", " "), texts))
inputs = {"source_sentence": texts}
return self.embed(input=inputs)['text_embedding'].tolist()
async def _aget_query_embedding(self, query: str) -> List[float]:
return self._get_query_embedding(query)
加载数据后,基于文档对象列表(或节点列表),建设他们的index,就可以方便的检索他们。
embeddings = ModelScopeEmbeddings4LlamaIndex(model_id=embedding_model)
'''
ModelScopeEmbeddings4LlamaIndex(model_name='unknown', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7faa7c073a60>, embed=<modelscope.pipelines.nlp.sentence_embedding_pipeline.SentenceEmbeddingPipeline object at 0x7faaac16ce20>, model_id='iic/nlp_gte_sentence-embedding_chinese-base')
'''
service_context = ServiceContext.from_defaults(embed_model=embeddings, llm=llm)
set_global_service_context(service_context)
documents = SimpleDirectoryReader("xxx/datasets/xianjiaoda/").load_data()
index = VectorStoreIndex.from_documents(documents)
documents 结构如下
[Document(
id_ = '92d9a2fd-d55d-4d95-bff1-52f0f1262d31',
embedding = None,
metadata = {
'file_path': '/home/xx/datasets/xianjiaoda/xianjiaoda.md',
'file_name': '/home/xx/datasets/xianjiaoda/xianjiaoda.md',
'file_type': 'text/markdown',
'file_size': 13228,
'creation_date': '2024-03-20',
'last_modified_date': '2024-01-16'
},
excluded_embed_metadata_keys = ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'],
excluded_llm_metadata_keys = ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'],
relationships = {},
text = '西安交通大学是我国最早兴办、享誉海内外的著名高等学府,是教育部直属重点大学。...\n\n精髓:与xxx民同呼吸、共命运\n\n...',
start_char_idx = None,
end_char_idx = None,
text_template = '{metadata_str}\n\n{content}',
metadata_template = '{key}: {value}',
metadata_seperator = '\n')]
query_engine = index.as_query_engine()
response = query_engine.query("西安交大是由哪几个学校合并的?")
print(response)
伊织 2024-03-20(三)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。