当前位置:   article > 正文

使用Qwen2进行RAG代码实践_qwen2 rag

qwen2 rag

前几天qwen2发布, 看与其他模型对比,效果挺棒的。用这个新模型写点东西, 测试下效果, 就测试下rag效果吧。
RAG是大模型的补充, 可归类为提示词工程范畴(prompt),旨在扩展大模型中没有的知识. 具有可解释性,难度低的特点.

RAG逻辑其实很简单,也不需要微调模型,就是个外挂知识库,但要想达到好的效果,还是需要花心思深究的。
难点重点: 向量数据库的建设,提取向量的模型的挑选,信息检索方法等
允许模型在生成文本时,从外部知识库中检索相关信息,从而提高生成内容的准确性、相关性和时效性

数据下载地址, 就用汽车知识问答数据集来操作, 这个数据处理起来相对简单.

在这里插入图片描述

rag步骤:

  1. 准备数据文档
  2. 构建向量库
  3. 以问题向量查询向量库
  4. 问题与向量库返回内容组成新的prompt
  5. 新prompt传入大模型返回结果

下面对以上步骤逐一分析

一 准备数据

简单地说,就是读取数据, 数据有各种格式, 这里我们用的问题时json, 知识文档是pdf

import jieba, json, pdfplumber

# 对长文本进行切分
def split_text_fixed_size(text, chunk_size, overlap_size):
    new_text = []
    for i in range(0, len(text), chunk_size):
        if i == 0:
            new_text.append(text[0:chunk_size])
        else:
            new_text.append(text[i - overlap_size:i + chunk_size])
            # new_text.append(text[i:i + chunk_size])
    return new_text

def read_data(query_data_path, knowledge_data_path):
    with open(query_data_path, 'r', encoding='utf-8') as f:
        questions = json.load(f)

    pdf = pdfplumber.open(knowledge_data_path)
    # 标记当前页与其文本知识
    pdf_content = []
    for page_idx in range(len(pdf.pages)):
        text = pdf.pages[page_idx].extract_text()
        new_text = split_text_fixed_size(text, chunk_size=100, overlap_size=5)
        for chunk_text in new_text:
            pdf_content.append({
                'page'   : 'page_' + str(page_idx + 1),
                'content': chunk_text
            })
    return questions, pdf_content
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29

二 构建向量库

流程: 用模型将知识提取为向量形式, 落盘存储在本地或内存中.
流程看起来很简单, 但具体操作起来有很多影响效果的细节. 比如用什么模型提取向量呢?
这里演示流程, 暂不探讨这些细节了.

将句子映射为向量的模型很多,常用的基本都是bert的衍生模型
https://huggingface.co/spaces/mteb/leaderboard
这里我用stella_base_zh_v3_1792d, 想要获得更好的提取效果, 需要以当前数据微调模型
可以用多种方法构建多个向量库,在检索时再多结果重排,提高召回率
这里使用两种方法提取向量

# 文本检索类向量库
pdf_content_words = [jieba.lcut(x['content']) for x in pdf_content]
bm25 = BM25Okapi(pdf_content_words)

# 语义检索类向量库
model = SentenceTransformer(
        # 'E:\PyCharm\PreTrainModel\stella_base_zh_v3_1792d'
        '/mnt/e/PyCharm/PreTrainModel/stella_base_zh_v3_1792d',
        # '/mnt/e/PyCharm/PreTrainModel/moka_aim3e_small',
)
question_sentences = [x['question'] for x in questions]
pdf_content_sentences = [x['content'] for x in pdf_content]

question_embeddings = model.encode(question_sentences, normalize_embeddings=True)
pdf_embeddings = model.encode(pdf_content_sentences, normalize_embeddings=True)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

三 向量检索

查询向量与向量库所有向量计算相似度,返回得分最高的向量。但要获得更高召回率,需要混合多种方法。在上一节,提到可以构建多个向量库,可以使得查询向量与每个向量库中每条数据计算相似度,之后对每个向量库返回的top-k结果进行重新排序,再提取top-k个结果,即多路召回与结果重排,可以显著提高RAG能力。
在这里插入图片描述
这里使用文件检索+语义检索,各返回topk结果,之后使用bge-reranker-base模型重排结果。

# 使用重排模型,获得当前数据对最高得分对应的索引
def get_rank_index(max_score_page_idxs_, questions_, pdf_content_):
    pairs = []
    for idx in max_score_page_idxs_:
        pairs.append([questions_[query_idx]["question"], pdf_content_[idx]['content']])

    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        inputs = {key: inputs[key].cuda() for key in inputs.keys()}
        scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
    
	max_score=scores.cpu().numpy().argmax()
    index = max_score_page_idxs[max_score]

    return max_score, index
 
for query_idx in range(len(questions)):
    # 首先进行BM25检索
    doc_scores = bm25.get_scores(jieba.lcut(questions[query_idx]["question"]))
    bm25_score_page_idxs = doc_scores.argsort()[-10:]
    
    # 再进行语义检索
    score = question_embeddings[query_idx] @ pdf_embeddings.T
    ste_score_page_idxs = score.argsort()[-10: ]
    # questions[query_idx]['reference'] = 'page_' + str(max_score_page_idx)
    # questions[query_idx]['reference'] = pdf_content[max_score_page_idxs]['page']

	bm25_score,bm25_index=get_rank_index(bm25_score_page_idxs,questions, pdf_content)
	ste_score,ste_index=get_rank_index(ste_score_page_idxs,questions, pdf_content)
	
	if ste_score>=bm25_score:
		questions[query_idx]['reference'] = 'page_' + str(ste_index+ 1)
	else:
		questions[query_idx]['reference'] = 'page_' + str(bm25_index+ 1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

四,五 构建新prompt进行大模型RAG推理

整合以上各模块,以qwen作为推理大模型

# -*- coding: utf-8 -*-
# @Time    : 2024/6/13 23:41
# @Author  : yblir
# @File    : qwen2_rag_test.py
# explain  : 
# =======================================================
# from openai import OpenAI
import jieba, json, pdfplumber
# import numpy as np
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.preprocessing import normalize
from rank_bm25 import BM25Okapi
# import requests
# 加载重排序模型
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

# client = OpenAI(api_key="sk-13c3a38819f84babb5cd298e001a10cb", base_url="https://api.deepseek.com")
device = "cuda"

rerank_tokenizer = AutoTokenizer.from_pretrained(r'E:\PyCharm\PreTrainModel\bge-reranker-base')
rerank_model = AutoModelForSequenceClassification.from_pretrained(r'E:\PyCharm\PreTrainModel\bge-reranker-base')
rerank_model.cuda()

model_path = r'E:\PyCharm\PreTrainModel\qwen2-1_5b'
# model_path = r'E:\PyCharm\PreTrainModel\qwen_7b_chat'
# model_path = r'E:\PyCharm\PreTrainModel\qwen2_7b_instruct_awq_int4'
tokenizer = AutoTokenizer.from_pretrained(
        model_path,
        # trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_path,
        torch_dtype="auto",
        device_map="auto",
        # trust_remote_code=True
        # attn_implementation="flash_attention_2"
)


# 对长文本进行切分
def split_text_fixed_size(text, chunk_size, overlap_size):
    new_text = []
    for i in range(0, len(text), chunk_size):
        if i == 0:
            new_text.append(text[0:chunk_size])
        else:
            new_text.append(text[i - overlap_size:i + chunk_size])
            # new_text.append(text[i:i + chunk_size])
    return new_text


def get_rank_index(max_score_page_idxs_, questions_, pdf_content_):
    pairs = []
    for idx in max_score_page_idxs_:
        pairs.append([questions_[query_idx]["question"], pdf_content_[idx]['content']])

    inputs = rerank_tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        inputs = {key: inputs[key].cuda() for key in inputs.keys()}
        scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()

    max_score = scores.cpu().numpy().argmax()
    index = max_score_page_idxs_[max_score]

    return max_score, index


def read_data(query_data_path, knowledge_data_path):
    with open(query_data_path, 'r', encoding='utf-8') as f:
        questions = json.load(f)

    pdf = pdfplumber.open(knowledge_data_path)
    # 标记当前页与其文本知识
    pdf_content = []
    for page_idx in range(len(pdf.pages)):
        text = pdf.pages[page_idx].extract_text()
        new_text = split_text_fixed_size(text, chunk_size=100, overlap_size=5)
        for chunk_text in new_text:
            pdf_content.append({
                'page'   : 'page_' + str(page_idx + 1),
                'content': chunk_text
            })
    return questions, pdf_content


def qwen_preprocess(tokenizer_, ziliao, question):
    """
    最终处理后,msg格式如下,system要改成自己的:
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me who you are."},
        {"role": "assistant", "content": "I am a large language model named Qwen..."}
    ]
    """

    # tokenizer.apply_chat_template() 与model.generate搭配使用
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"帮我结合给定的资料,回答问题。如果问题答案无法从资料中获得,"
                                    f"输出结合给定的资料,无法回答问题. 如果找到答案, 就输出找到的答案, 资料:{ziliao}, 问题:{question}"},
    ]
    # dd_generation_prompt 参数用于在输入中添加生成提示,该提示指向 <|im_start|>assistant\n
    text = tokenizer_.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs_ = tokenizer_([text], return_tensors="pt").to(device)

    input_ids = tokenizer_.encode(text, return_tensors='pt')
    attention_mask_ = torch.ones(input_ids.shape, dtype=torch.long, device=device)
    # print(model_inputs)
    # sys.exit()
    return model_inputs_, attention_mask_


if __name__ == '__main__':

    questions, pdf_content = read_data(query_data_path=r"E:\localDatasets\汽车问答系统\questions.json",
                                       knowledge_data_path=r'E:\localDatasets\汽车问答系统\初赛训练数据集.pdf')

    # 文本检索类向量库
    pdf_content_words = [jieba.lcut(x['content']) for x in pdf_content]
    bm25 = BM25Okapi(pdf_content_words)

    # 语义检索类向量库
    sent_model = SentenceTransformer(
            r'E:\PyCharm\PreTrainModel\stella_base_zh_v3_1792d'
            # '/mnt/e/PyCharm/PreTrainModel/stella_base_zh_v3_1792d',
            # '/mnt/e/PyCharm/PreTrainModel/moka_aim3e_small',
    )
    question_sentences = [x['question'] for x in questions]
    pdf_content_sentences = [x['content'] for x in pdf_content]

    question_embeddings = sent_model.encode(question_sentences, normalize_embeddings=True)
    pdf_embeddings = sent_model.encode(pdf_content_sentences, normalize_embeddings=True)

    for query_idx in range(len(questions)):
        # 首先进行BM25检索
        doc_scores = bm25.get_scores(jieba.lcut(questions[query_idx]["question"]))
        bm25_score_page_idxs = doc_scores.argsort()[-10:]

        # 再进行语义检索
        score = question_embeddings[query_idx] @ pdf_embeddings.T
        ste_score_page_idxs = score.argsort()[-10:]
        # questions[query_idx]['reference'] = 'page_' + str(max_score_page_idx)
        # questions[query_idx]['reference'] = pdf_content[max_score_page_idxs]['page']

        bm25_score, bm25_index = get_rank_index(bm25_score_page_idxs, questions, pdf_content)
        ste_score, ste_index = get_rank_index(ste_score_page_idxs, questions, pdf_content)

        max_score_page_idx = 0
        if ste_score >= bm25_score:
            questions[query_idx]['reference'] = 'page_' + str(ste_index + 1)
            max_score_page_idx = ste_index
        else:
            questions[query_idx]['reference'] = 'page_' + str(bm25_index + 1)
            max_score_page_idx = bm25_index

        model_inputs, attention_mask = qwen_preprocess(
                tokenizer, pdf_content[max_score_page_idx]['content'], questions[query_idx]["question"]
        )

        generated_ids = model.generate(
                model_inputs.input_ids,
                max_new_tokens=128,  # 最大输出长度.
                attention_mask=attention_mask,
                pad_token_id=tokenizer.eos_token_id
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]

        response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        # print(response)
        # answer = ask_glm(pdf_content[max_score_page_idx]['content'], questions[query_idx]["question"])
        print(f'question: {questions[query_idx]["question"]}, answer: {response}')

    # data_path = '/media/xk/D6B8A862B8A8433B/GitHub/llama-factory/data/train_clean_eval.json'

    # with open(data_path, 'r', encoding='utf-8') as f:
    #     data = json.load(f)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182

共测试了qwen2-1.5b,qwen-7b-chat, qwen2-7b-instruct-awq-int4三个模型, qwen2-1.5b预测不能停止的问题还是存在,qwen2-7b-instruct-awq-int4 的RAG效果明细比qwen-7b-chat好,侧面印证了qwen2能力有明细提升。

qwen2-1.5b
在这里插入图片描述

qwen-7b-chat
在这里插入图片描述

qwen2-7b-awq-int4
在这里插入图片描述

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/872157
推荐阅读
相关标签
  

闽ICP备14008679号