当前位置:   article > 正文

在亚马逊云科技AWS上利用PEFT和RLHF高效微调AI大模型减少有害回复

在亚马逊云科技AWS上利用PEFT和RLHF高效微调AI大模型减少有害回复

简介:

小李哥将继续每天介绍一个基于亚马逊云科技AWS云计算平台的全球前沿AI技术解决方案,帮助大家快速了解国际上最热门的云计算平台亚马逊云科技AWS AI最佳实践,并应用到自己的日常工作里。

本次我将介绍如何用亚马逊云科技的AI模型训练服务Amazon SageMaker和PEFT、RLHF框架高效微调AI大模型FLAN-T5-BASE,减少大模型回复过程中的潜在有害内容。我将带领大家手把手通过一行一行的代码学会AI模型的微调,0基础学会AI核心技能。本架构设计还包括了与用户交互的前后端应用,全部采用了云原生Serverless架构,提供可扩展和安全的AI应用解决方案。本方案架构图如下

项目开发背景知识 

参数高效微调(PEFT)和基于人类反馈的强化学习(RLHF)都是我们在 AI 大模型微调中使用的常见方法。PEFT 通过选择性地调整模型的一部分参数,提高了微调过程的效率和资源利用率,而 RLHF 则通过引入人类反馈,优化模型的表现,减少偏见和有害暗示。这两种方法各有优势,可以互补使用,以实现更高效、更可靠的机器学习模型微调,满足不同应用场景的需求。

参数高效微调(PEFT)

参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)是一种在微调预训练模型时只调整部分参数的方法。与传统的全参数微调相比,PEFT 专注于模型中的一小部分参数,这不仅可以减少计算资源的消耗,还能提高微调过程的效率。PEFT 方法通过选择性地更新模型参数,保持了模型的整体结构和大部分预训练信息,从而在保持高性能的同时,显著减少训练时间和资源消耗。这种方法特别适用于资源有限的开发环境,同时也能在大规模模型上取得优异的效果。

基于人类反馈的强化学习(RLHF)

基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)是一种结合人类反馈来优化机器学习模型的算法。RLHF 通过引入人类在训练过程中的评价和反馈,指导模型进行自我调整和优化。具体而言,RLHF 利用人类标注的奖励信号来更新模型的策略,使其更符合预期的行为和输出。这种方法可以显著提升模型的表现,尤其在处理涉及伦理和偏见的问题时,RLHF 能够通过人类的反馈来减少有害暗示和不良行为,提高模型的安全性和可靠性。RLHF 在对话系统、推荐系统等应用中表现出色,能够有效提高用户体验和模型的实际应用价值。

本方案包括的内容:

  • 使用 Amazon SageMaker Studio 微调基础模型。

  • 使用参数高效微调(PEFT)在一部分参数上进行微调,提升微调效率降低成本。

  • 使用基于人类反馈的强化学习(RLHF)算法优化大语言模型。

  • 分析微调对减少有害回复的效果。

  • 将微调结果上传到 Amazon DynamoDB 表中保存。

项目搭建具体步骤:

1. 首先我们打开亚马逊云科技控制台,打开SageMaker服务主页,进入到我们的SageMaker Studio中。

2. 接下来我们创建一个新的Jupyter Notebook,开始我们的模型微调。首先我们安装必要的依赖

  1. %%capture
  2. %pip install torch==2.0.1 torchdata
  3. %pip install transformers==4.28.1
  4. %pip install datasets==2.17.0
  5. %pip install accelerate==0.16.0
  6. %pip install evaluate==0.4.0
  7. %pip install trl==0.7.1
  8. %pip install rouge_score==0.1.2
  9. %pip install loralib==0.1.1
  10. %pip install peft==0.3.0
  11. %pip install -q awswrangler

 3. 在Notebook中导入必要的依赖

  1. from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
  2. from peft import PeftModel, PeftConfig, LoraConfig, TaskType
  3. # trl: Transformer Reinforcement Learning library
  4. from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
  5. from trl import create_reference_model
  6. from trl.core import LengthSampler
  7. import torch
  8. import evaluate
  9. import numpy as np
  10. import pandas as pd
  11. import peft
  12. # tqdm library makes the loops show a smart progress meter.
  13. from tqdm import tqdm
  14. tqdm.pandas()

4. 接下来我们导入我们用到的测试数据集“knkarthick/dialogsum”和AI大模型"google/flan-t5-base"

  1. from datasets import load_dataset
  2. model_name="google/flan-t5-base"
  3. huggingface_dataset_name = "knkarthick/dialogsum"
  4. dataset_original = load_dataset(huggingface_dataset_name)
  5. dataset_original

 5. 下面我们对数据集预处理,定义函数build_dataset选择合适长度(200-1000字符)的数据,初始化Tokenizer, 将数据集数据封装到提示词并解码成PPO库的标准格式,按2/8比例分割数据集为测试集和训练集。

  1. from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
  2. def build_dataset(model_name,
  3. dataset_name,
  4. input_min_text_length,
  5. input_max_text_length):
  6. # Load dataset (the "train" part only is enough for this lab).
  7. dataset = load_dataset(dataset_name, split="train")
  8. # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
  9. dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)
  10. # Prepare the tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
  11. tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto",force_download=True)
  12. def tokenize(sample):
  13. # Wrap each dialogue with the instruction.
  14. prompt = f"""
  15. Summarize the following conversation.
  16. {sample["dialogue"]}
  17. Summary:
  18. """
  19. sample["input_ids"] = tokenizer.encode(prompt)
  20. # This must be called "query", which is a requirement of our PPO library.
  21. sample["query"] = tokenizer.decode(sample["input_ids"])
  22. return sample
  23. # Tokenize each dialogue.
  24. dataset = dataset.map(tokenize, batched=False)
  25. dataset.set_format(type="torch")
  26. # Split the dataset into train and test parts.
  27. dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)
  28. return dataset_splits
  29. dataset = build_dataset(model_name=model_name,
  30. dataset_name=huggingface_dataset_name,
  31. input_min_text_length=200,
  32. input_max_text_length=1000)
  33. print(dataset)

6. 我们定义一个函数来查看模型参数数量

  1. def print_number_of_trainable_model_parameters(model):
  2. trainable_model_params = 0
  3. all_model_params = 0
  4. for _, param in model.named_parameters():
  5. all_model_params += param.numel()
  6. if param.requires_grad:
  7. trainable_model_params += param.numel()
  8. return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

7. 下面我们为大语言模型添加适配器LoRA,并结合 PEFT 方法,可以在保持模型性能的同时,大幅减少微调所需的参数数量和计算资源。

  1. lora_config = LoraConfig(
  2. r=32, # Rank
  3. lora_alpha=32,
  4. target_modules=["q", "v"],
  5. lora_dropout=0.05,
  6. bias="none",
  7. task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
  8. )
  9. model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
  10. torch_dtype=torch.bfloat16)
  11. peft_model = PeftModel.from_pretrained(model,
  12. './peft-dialogue-summary-checkpoint-from-s3/',
  13. lora_config=lora_config,
  14. torch_dtype=torch.bfloat16,
  15. device_map="auto",
  16. is_trainable=True)
  17. print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')

 8. 基于摸大型加载并创建 PPO 模型,用于之后的Reinforcement Learning

  1. ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
  2. torch_dtype=torch.bfloat16,
  3. is_trainable=True)
  4. ref_model = create_reference_model(ppo_model)

9. 接下来我们创建奖励模型Meta RoBERTa,用来评估数据集对话中的有害内容。

  1. toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
  2. toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
  3. toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
  4. print(toxicity_model.config.id2label)

10. 我们输入两个实例评估他们是否属于仇恨言论,并输出仇恨分数。仇恨Logit分数越高,则表示这段话越有可能是仇恨言论。

  1. non_toxic_text = "You are a great person and i like you."
  2. toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids
  3. logits = toxicity_model(input_ids=toxicity_input_ids).logits
  4. print(f'logits [not hate, hate]: {logits.tolist()[0]}')
  5. # Print the probabilities for [not hate, hate]
  6. probabilities = logits.softmax(dim=-1).tolist()[0]
  7. print(f'probabilities [not hate, hate]: {probabilities}')
  8. # get the logits for "not hate" - this is the reward!
  9. not_hate_index = 0
  10. nothate_reward = (logits[:, not_hate_index]).tolist()
  11. print(f'reward (value of "not hate" logit): {nothate_reward}')
  12. toxic_text = "You are a terrible person and i hate you."
  13. toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids
  14. logits = toxicity_model(toxicity_input_ids).logits
  15. print(f'logits [not hate, hate]: {logits.tolist()[0]}')
  16. # Print the probabilities for [not hate, hate]
  17. probabilities = logits.softmax(dim=-1).tolist()[0]
  18. print(f'probabilities [not hate, hate]: {probabilities}')
  19. # Get the logits for "not hate" - this is the reward!
  20. nothate_reward = (logits[:, not_hate_index]).tolist()
  21. print(f'reward (value of "not hate" logit): {nothate_reward}')

 11. 创建一个大模型生成内容分析管道,从生成内容中获取结果。

  1. device = 0 if torch.cuda.is_available() else "cpu"
  2. sentiment_pipe = pipeline("sentiment-analysis",
  3. model=toxicity_model_name,
  4. device=device)
  5. reward_logits_kwargs = {
  6. "top_k": None, # Return all scores.
  7. "function_to_apply": "none", # Set to "none" to retrieve raw logits.
  8. "batch_size": 16
  9. }
  10. reward_probabilities_kwargs = {
  11. "top_k": None, # Return all scores.
  12. "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
  13. "batch_size": 16
  14. }
  15. print("Reward model output for non-toxic text:")
  16. print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
  17. print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
  18. print("\nReward model output for toxic text:")
  19. print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
  20. print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

12.  下面我们定一个生成内容的评估器,用于分析生成内容的有害性。定义函数evaluate_toxicity,将模型有害性以0-1之间的分数展示,1表示最高有害性。并且在微调之前得到模型的有害性分数。

  1. toxicity_evaluator = evaluate.load("toxicity",
  2. toxicity_model_name,
  3. module_type="measurement",
  4. toxic_label="hate")
  5. toxicity_score = toxicity_evaluator.compute(predictions=[
  6. non_toxic_text
  7. ])
  8. print("Toxicity score for non-toxic text:")
  9. print(toxicity_score["toxicity"])
  10. toxicity_score = toxicity_evaluator.compute(predictions=[
  11. toxic_text
  12. ])
  13. print("\nToxicity score for toxic text:")
  14. print(toxicity_score["toxicity"])
  15. def evaluate_toxicity(model,
  16. toxicity_evaluator,
  17. tokenizer,
  18. dataset,
  19. num_samples):
  20. max_new_tokens=100
  21. toxicities = []
  22. input_texts = []
  23. for i, sample in tqdm(enumerate(dataset)):
  24. input_text = sample["query"]
  25. if i > num_samples:
  26. break
  27. input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
  28. generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
  29. tok_k=0.0,
  30. top_p=1.0,
  31. do_sample=True)
  32. response_token_ids = model.generate(input_ids=input_ids,
  33. generation_config=generation_config)
  34. generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
  35. toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])
  36. toxicities.extend(toxicity_score["toxicity"])
  37. # Compute mean and std using np.
  38. mean = np.mean(toxicities)
  39. std = np.std(toxicities)
  40. return mean, std
  41. tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")
  42. mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model,
  43. toxicity_evaluator=toxicity_evaluator,
  44. tokenizer=tokenizer,
  45. dataset=dataset["test"],
  46. num_samples=10)
  47. print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

18. 接下来我们利用PPO对大语言模型进行强化学习,降低有害回复。我们首先配置PPO模型和模型超参数并创建PPO训练器trainer,用于对大语言模型进行微调强化训练。

  1. learning_rate=1.41e-5
  2. max_ppo_epochs=1
  3. mini_batch_size=4
  4. batch_size=16
  5. config = PPOConfig(
  6. model_name=model_name,
  7. learning_rate=learning_rate,
  8. ppo_epochs=max_ppo_epochs,
  9. mini_batch_size=mini_batch_size,
  10. batch_size=batch_size
  11. )
  12. def collator(data):
  13. return dict((key, [d[key] for d in data]) for key in data[0])
  14. ppo_trainer = PPOTrainer(config=config,
  15. model=ppo_model,
  16. ref_model=ref_model,
  17. tokenizer=tokenizer,
  18. dataset=dataset["train"],
  19. data_collator=collator)

19. 接下来我们配置训练的参数,模型内容生成参数和奖励参数,开始对PPO模型进行强化训练,这个训练器会最大化positive正向回复内容的奖励分数,使微调后的模型回复分数接近于1。

  1. output_min_length = 100
  2. output_max_length = 400
  3. output_length_sampler = LengthSampler(output_min_length, output_max_length)
  4. generation_kwargs = {
  5. "min_length": 5,
  6. "top_k": 0.0,
  7. "top_p": 1.0,
  8. "do_sample": True
  9. }
  10. reward_kwargs = {
  11. "top_k": None, # Return all scores.
  12. "function_to_apply": "none", # You want the raw logits without softmax.
  13. "batch_size": 16
  14. }
  15. max_ppo_steps = 10
  16. for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
  17. # Break when you reach max_steps.
  18. if step >= max_ppo_steps:
  19. break
  20. prompt_tensors = batch["input_ids"]
  21. # Get response from FLAN-T5/PEFT LLM.
  22. summary_tensors = []
  23. for prompt_tensor in prompt_tensors:
  24. max_new_tokens = output_length_sampler()
  25. generation_kwargs["max_new_tokens"] = max_new_tokens
  26. summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
  27. summary_tensors.append(summary.squeeze()[-max_new_tokens:])
  28. # This needs to be called "response".
  29. batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]
  30. # Compute reward outputs.
  31. query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
  32. rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)
  33. # You use the "nothate" item because this is the score for the positive "nothate" class.
  34. reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]
  35. # Run the PPO step.
  36. stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
  37. ppo_trainer.log_stats(stats, batch, reward_tensors)
  38. print(f'objective/kl: {stats["objective/kl"]}')
  39. print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
  40. print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
  41. print('-'.join('' for x in range(100)))

20. 接下来我们利用“evaluate_toxicity”函数,对微调后的模型进行评估

  1. mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
  2. toxicity_evaluator=toxicity_evaluator,
  3. tokenizer=tokenizer,
  4. dataset=dataset["test"],
  5. num_samples=10)
  6. print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

21. 再将模型有害性分数和训练前进行比较。

  1. mean_improvement = (mean_after_detoxification - mean_before_detoxification) / mean_before_detoxification
  2. std_improvement = (std_after_detoxification - std_before_detoxification) / std_before_detoxification
  3. print(f'Percentage improvement of toxicity score after detoxification:')
  4. print(f'mean: {mean_improvement*100:.2f}%')
  5. print(f'std: {std_improvement*100:.2f}%')

22. 获取微调前的响应和微调后的响应的数据/正向分数、正向分数差、用户提问,将其作为多列存到dataframe里。

  1. batch_size = 20
  2. compare_results = {}
  3. df_batch = dataset["test"][0:batch_size]
  4. compare_results["query"] = df_batch["query"]
  5. prompt_tensors = df_batch["input_ids"]
  6. summary_tensors_ref = []
  7. summary_tensors = []
  8. # Get response from the PPO and base model.
  9. for i in tqdm(range(batch_size)):
  10. gen_len = output_length_sampler()
  11. generation_kwargs["max_new_tokens"] = gen_len
  12. summary = ref_model.generate(
  13. input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
  14. **generation_kwargs
  15. ).squeeze()[-gen_len:]
  16. summary_tensors_ref.append(summary)
  17. summary = ppo_model.generate(
  18. input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
  19. **generation_kwargs
  20. ).squeeze()[-gen_len:]
  21. summary_tensors.append(summary)
  22. # Decode responses.
  23. compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
  24. compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]
  25. # Sentiment analysis of query/response pairs before/after.
  26. texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
  27. rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
  28. compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]
  29. texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
  30. rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
  31. compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]
  32. pd.set_option('display.max_colwidth', 500)
  33. df_compare_results = pd.DataFrame(compare_results)
  34. df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
  35. df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
  36. df_compare_results_sorted

23. 最后我们利用awswrangler库进行数据处理,为微调前的响应和微调后的响应的数据、添加索引列,存入DynamoDB中

  1. import awswrangler as wr
  2. # Add an index column to the data frame to act as the partition key
  3. df_compare_results['index'] = range(1, len(df_compare_results) + 1)
  4. # Create a results dataframe,reorganized with DynamoDB table attributes
  5. result = pd.DataFrame({
  6. "conversation_id": df_compare_results['index'],
  7. "query": df_compare_results['query'],
  8. "response_before": df_compare_results['response_before'],
  9. "response_after": df_compare_results['response_after']
  10. })
  11. # Upload result to DDB
  12. wr.dynamodb.put_df(df=result, table_name='llm_with_rlhf')

24. 通过Cloudfront打开S3内的静态网页,html文件会触发后端API Gateway调用数据库将数据显示到页面上。

以上就是在亚马逊云科技上利用SageMaker微调大模型,减少回复中的有害内容的全部步骤。欢迎大家关注小李哥,未来获取更多国际前沿的生成式AI开发方案!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/957670
推荐阅读
相关标签
  

闽ICP备14008679号