当前位置:   article > 正文

RAG - LlamaIndex + modelscope + Qwen1.5 构建本地知识库_qwen1.5 本地知识库

qwen1.5 本地知识库


项目说明

转载自:【RAG实践】基于LlamaIndex和Qwen1.5搭建基于本地知识库的问答机器人
https://mp.weixin.qq.com/s/RwjywuzfswSCKP_lw135YA
源码地址:
https://github.com/modelscope/modelscope/blob/master/examples/pytorch/application/qwen1.5_doc_search_QA_based_on_langchain.ipynb


  • 使用 modelscope 的 snapshot_download 方法,下载 qwen/Qwen1.5-4B-Chat 模型
  • 集成 llama_index 的 BaseEmbedding 方法,构建Embedding类
  • 使用 iic/nlp_gte_sentence-embedding_chinese-base 作为 Embedding 模型
  • 使用 llama_index.core.SimpleDirectoryReader 读取文档数据
  • 使用 llama_index.core.VectorStoreIndex 从文档构建索引,生成query_engine

准备

1、安装依赖库

pip install llama-index llama-index-llms-huggingface ipywidgets
pip install transformers -U
  • 1
  • 2

2、准备数据文件

西安交大的介绍
https://modelscope.oss-cn-beijing.aliyuncs.com/resource/rag/xianjiaoda.md


代码实现

1、引用

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from IPython.display import Markdown, display
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts import PromptTemplate
from modelscope import snapshot_download
from llama_index.core.base.embeddings.base import BaseEmbedding, Embedding
from abc import ABC
from typing import Any, List, Optional, Dict, cast
from llama_index.core import (
    VectorStoreIndex,
    ServiceContext,
    set_global_service_context,
    SimpleDirectoryReader,
)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

2、下载、加载LLM

使用 modelscope 的 snapshot_download 下载数据;
因为Qwen本次支持了 Transformers,使用HuggingFaceLLM加载模型,模型为(Qwen1.5-4B-Chat)

qwen2_4B_CHAT = "qwen/Qwen1.5-4B-Chat"

selected_model = snapshot_download(qwen2_4B_CHAT)

SYSTEM_PROMPT = """You are a helpful AI assistant.
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=2048,
    generate_kwargs={"temperature": 0.0, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # change these settings below depending on your GPU
    model_kwargs={"torch_dtype": torch.float16},
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

3、构建Embedding类

加载GTE模型,使用GTE模型构造Embedding类

embedding_model = "iic/nlp_gte_sentence-embedding_chinese-base"
class ModelScopeEmbeddings4LlamaIndex(BaseEmbedding, ABC):
    embed: Any = None
    model_id: str = "iic/nlp_gte_sentence-embedding_chinese-base"
    def __init__(
            self,
            model_id: str,
            **kwargs: Any,
    ) -> None:
        super().__init__(**kwargs)
        try:
            from modelscope.models import Model
            from modelscope.pipelines import pipeline
            from modelscope.utils.constant import Tasks
            # 使用modelscope的embedding模型(包含下载)
            self.embed = pipeline(Tasks.sentence_embedding, model=self.model_id)
        except ImportError as e:
            raise ValueError(
                "Could not import some python packages." "Please install it with `pip install modelscope`."
            ) from e
    def _get_query_embedding(self, query: str) -> List[float]:
        text = query.replace("\n", " ")
        inputs = {"source_sentence": [text]}
        return self.embed(input=inputs)['text_embedding'][0].tolist()
    def _get_text_embedding(self, text: str) -> List[float]:
        text = text.replace("\n", " ")
        inputs = {"source_sentence": [text]}
        return self.embed(input=inputs)['text_embedding'][0].tolist()
    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        texts = list(map(lambda x: x.replace("\n", " "), texts))
        inputs = {"source_sentence": texts}
        return self.embed(input=inputs)['text_embedding'].tolist()
    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

4、建设索引

加载数据后,基于文档对象列表(或节点列表),建设他们的index,就可以方便的检索他们。

embeddings = ModelScopeEmbeddings4LlamaIndex(model_id=embedding_model)

'''
ModelScopeEmbeddings4LlamaIndex(model_name='unknown', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7faa7c073a60>, embed=<modelscope.pipelines.nlp.sentence_embedding_pipeline.SentenceEmbeddingPipeline object at 0x7faaac16ce20>, model_id='iic/nlp_gte_sentence-embedding_chinese-base')
'''

service_context = ServiceContext.from_defaults(embed_model=embeddings, llm=llm)

set_global_service_context(service_context)

documents = SimpleDirectoryReader("xxx/datasets/xianjiaoda/").load_data()
index = VectorStoreIndex.from_documents(documents)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

documents 结构如下

[Document(
id_ = '92d9a2fd-d55d-4d95-bff1-52f0f1262d31', 
embedding = None, 
metadata = {
	'file_path': '/home/xx/datasets/xianjiaoda/xianjiaoda.md',
	'file_name': '/home/xx/datasets/xianjiaoda/xianjiaoda.md',
	'file_type': 'text/markdown',
	'file_size': 13228,
	'creation_date': '2024-03-20',
	'last_modified_date': '2024-01-16'
}, 
excluded_embed_metadata_keys = ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
excluded_llm_metadata_keys = ['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], 
relationships = {}, 
text = '西安交通大学是我国最早兴办、享誉海内外的著名高等学府,是教育部直属重点大学。...\n\n精髓:与xxx民同呼吸、共命运\n\n...', 
start_char_idx = None, 
end_char_idx = None, 
text_template = '{metadata_str}\n\n{content}', 
metadata_template = '{key}: {value}', 
metadata_seperator = '\n')]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

5、查询和问答

query_engine = index.as_query_engine()
response = query_engine.query("西安交大是由哪几个学校合并的?")
print(response)

  • 1
  • 2
  • 3
  • 4

伊织 2024-03-20(三)

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号