赞
踩
ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM2-6B 引入了如下新特性:
更强大的性能=混合目标函数+1.4T中英标识符:基于 ChatGLM 初代模型的开发经验,我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 GLM 的混合目标函数,经过了 1.4T 中英标识符的预训练与人类偏好对齐训练,评测结果显示,相比于初代模型,ChatGLM2-6B 在 MMLU(+23%)、CEval(+33%)、GSM8K(+571%) 、BBH(+60%)等数据集上的性能取得了大幅度的提升,在同尺寸开源模型中具有较强的竞争力。
更长的上下文=Flash Attention技术+上下文长度扩展到32K+8K训练+多轮对话:基于 Flash Attention 技术,我们将基座模型的上下文长度(Context Length)由 ChatGLM-6B 的 2K 扩展到了 32K,并在对话阶段使用 8K 的上下文长度训练,允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限,我们会在后续迭代升级中着重进行优化。
更高效的推理=Multi-Query Attention技术+INT4量化:基于 Multi-Query Attention 技术,ChatGLM2-6B 有更高效的推理速度和更低的显存占用:在官方的模型实现下,推理速度相比初代提升了 42%,INT4 量化下,6G 显存支持的对话长度由 1K 提升到了 8K。
更开放的协议:ChatGLM2-6B 权重对学术研究完全开放,在获得官方的书面许可后,亦允许商业使用。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。
最近我尝试使用ChatGLM2-6b这个大模型来解决一个文本二分类任务,在微调和使用过程中,遇到了一些需要注意的点,本文将给出更详细的经验总结。
我的数据集包含标题、作者、摘要等字段,首先读取csv格式的数据,然后将其转换成模型可以处理的格式:
import pandas as pd train_df = pd.read_csv('./csv_data/train.csv') test_df = pd.read_csv('./csv_data/test.csv') ## 制作数据集 res = [] for i in range(len(train_df)): paper_item = train_df.loc[i] tmp = { "instruction": "Please judge...", "input": f"title:{paper_item[1]},abstract:{paper_item[3]}", "output": str(paper_item[5]) } res.append(tmp) import json with open('paper_label.json', mode='w', encoding='utf-8') as f: json.dump(res, f, ensure_ascii=False, indent=4)
另外,中文文本在JSON存储时需要设置ensure_ascii=False,这和Unicode编码有关,可以避免中文出现乱码。
微调大模型
利用ChatGLM的微调脚本,在包含标题和摘要的文本上微调ChatGLM2-6b模型。此处需要注意的难点是:
微调需要消耗大量GPU算力,需要准备至少24G显存的高端GPU。
微调需要指定正确的模型路径,否则会导致错误。
如果遇到内存不足的问题,可以适当调小batch size。
微调预训练语言模型是迁移学习的一种典型应用。我们希望让模型学习特定的下游任务,而不是从零开始训练。在微调过程中,我选用了一种称为LoRA的技巧,其基本思想是在预训练语言模型中插入新的分类头,然后在下游任务的数据集上进行全模型联合训练。这种方式可以很好地融合预训练模型和下游任务,在许多NLP竞赛中能取得不错的效果。
利用Peft加载微调得到的LoRA权重构建预测函数,代码如下:
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel, GenerationConfig, AutoModelForCausalLM
model_path = "chatglm2-6b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 加载LoRA权重
model = PeftModel.from_pretrained(model, 'huanhuan-chat/output/label_xfg').half()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
response
加载微调后的语言模型进行预测时,我们需要注意将模型切换到eval模式。这是PyTorch的知识点,eval模式将 BN和Dropout固定住,可以提高预测的稳定性。另外,为了获得确定的预测输出,可以设置temperature=0.01来取 argmax。
# 预测函数 def predict(text): response, history = model.chat(tokenizer, f"Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title, author and abstract -->{text}", history=[], temperature=0.01) return response predict('title:Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks,author:Li, Chenqi; Lammie, Corey; Dong, Xuening; Amirsoleimani, Amirali; Azghadi, Mostafa Rahimi; Genov, Roman,abstract:During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and areaconstrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to State-Of-The-Art (SOTA) CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of theCNNcomponent of our system. We parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS Deep Learning (DL) accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to lowAnalog-to-Digital Converter (ADC)/Digital-to-Analog Converter (DAC) resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791Wof power while occupying an area of 31.255 mm(2) in a 22 nm FDSOI CMOS process.') # 预测测试集 from tqdm import tqdm label = [] for i in tqdm(range(len(test_df))): test_item = test_df.loc[i] test_input = f"title:{test_item[1]},author:{test_item[2]},abstract:{test_item[3]}" label.append(int(predict(test_input))) test_df['label'] = label submit = test_df[['uuid', 'Keywords', 'label']] submit.to_csv('submit.csv', index=False)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。