赞
踩
预训练是指通过自监督学习从大规模数据中获得与具体任务无关的预训练模型的过程,最终产出为预训练模型(Pretrained Model)。
模型类型 | 常用预训练模型 | 适用任务 |
---|---|---|
编码器模型,自编码模型 | ALBERT,BERT,DistilBERT,RoBERTa | 文本分类、命名实体识别、阅读理解 |
解码器模型,自回归模型 | GPT,GPT-2,Bloom,LLaMA | 文本生成 |
编码器解码器模型,序列到序列模型 | BART,T5,Marian,mBART | 文本摘要、机器翻译 |
主要分为三大类:
# Step1 导入相关包 from datasets import load_dataset, Dataset from transformers import AutoTokenizer, AutoModelForMaskedLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer from torch.utils.data import DataLoader # Step2 加载数据集 ds = Dataset.load_from_disk(data_path) # Step3 数据集处理 def process_func(examples): return tokenizer(examples["completion"], max_length=384, truncation=True) tokenizer = AutoTokenizer.from_pretrained(model_path) tokenized_ds = ds.map(process_func, batched=True, remove_columns=ds.column_names) dl = DataLoader(tokenized_ds, batch_size=2, collate_fn=DataCollatorForLanguageModeling(tokenizer, mlm=True, mlm_probability=0.15)) print(next(enumerate(dl))) # Step4 创建模型 model = AutoModelForMaskedLM.from_pretrained(model_path) # Step5 配置训练参数 args = TrainingArguments( output_dir="./masked_lm", per_device_train_batch_size=32, logging_steps=10, num_train_epochs=1 ) # Step6 创建训练器 trainer = Trainer( args=args, model=model, train_dataset=tokenized_ds, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=True, mlm_probability=0.15) ) # Step7 模型训练 trainer.train() # Step8 模型推理 from transformers import pipeline pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=0) print(pipe("西安交通[MASK][MASK]博物馆(Xi'an Jiaotong University Museum)是一座位于西安交通大学的博物馆")) print(pipe("下面是一则[MASK][MASK]新闻。小编报道,近日,游戏产业发展的非常好!"))
# Step1 导入相关包 from datasets import load_dataset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, TrainingArguments, Trainer, BloomForCausalLM from torch.utils.data import DataLoader # Step2 加载数据集 ds = Dataset.load_from_disk(data_path) # Step3 数据集处理 def process_func(examples): contents = [e + tokenizer.eos_token for e in examples["completion"]] return tokenizer(contents, max_length=384, truncation=True) tokenizer = AutoTokenizer.from_pretrained(model_path) tokenized_ds = ds.map(process_func, batched=True, remove_columns=ds.column_names) dl = DataLoader(tokenized_ds, batch_size=2, collate_fn=DataCollatorForLanguageModeling(tokenizer, mlm=False)) print(next(enumerate(dl))) print(tokenizer.pad_token, tokenizer.pad_token_id) print(tokenizer.eos_token, tokenizer.eos_token_id) # Step4 创建模型 model = AutoModelForCausalLM.from_pretrained(model_path) # Step5 配置训练参数 args = TrainingArguments( output_dir="./causal_lm", per_device_train_batch_size=4, gradient_accumulation_steps=8, logging_steps=10, num_train_epochs=1 ) # Step6 创建训练器 trainer = Trainer( args=args, model=model, train_dataset=tokenized_ds, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) ) # Step7 模型训练 trainer.train() # Step8 模型推理 from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) print(pipe("西安交通大学博物馆(Xi'an Jiaotong University Museum)是一座位于西安", max_length=128, do_sample=True)) print(pipe("下面是一则游戏新闻。小编报道,近日,游戏产业发展的非常", max_length=128, do_sample=True))
ps:补充知识
Q:因果语言模型中的错位计算Loss在哪里实现的呢?
A:在模型部分的forward函数中实现
https://www.bilibili.com/video/BV1B44y1c7x2/?spm_id_from=333.788&vd_source=ff7eedd3479d8f6f369e631ec961cc05
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。