Amazon SageMaker是亚马逊云计算(Amazon Web Service)的一项完全托管的机器学习平台服务,算法工程师和数据科学家可以基于此平台快速构建、训练和部署机器学习 (ML) 模型,而无需关注底层资源的管理和运维工作。它作为一个工具集,提供了用于机器学习的端到端的所有组件,包括数据标记、数据处理、算法设计、模型训练、训练调试、超参调优、模型部署、模型监控等,使得机器学习变得更为简单和轻松;同时,它依托于 Amazon 强大的底层资源,提供了高性能 CPU、GPU、弹性推理加速卡等丰富的计算资源和充足的算力,使得模型研发和部署更为轻松和高效。同时,本文还基于 Huggingface,Huggingface 是 NLP 著名的开源社区,并且与 Amazon SagaMaker 高度适配,可以在 Amazon SagaMaker 上以几行代码轻松实现 NLP 模型训练和部署。



在此示例中, 我们将使用 Amazon SageMaker 执行以下操作:

  • 环境准备
  • 下载数据集并将其进行数据预处理
  • 使用本地机器训练
  • 使用 Amazon SageMaker BYOS 进行模型训练
  • 托管部署及推理测试


我们首先要创建一个 Amazon SageMaker Notebook,笔记本实例类型最好选择 ml.p3.2xlarge,因为本例中用到了本地机器训练的部分用来测试我们的代码,卷大小建议改成10GB或以上,因为运行该项目需要下载一些额外的数据。



  1. cd ~/SageMaker
  2. git clone https://github.com/HaoranLv/nlp_transformer.git



1.公开数据集 (英文)

  • XSUM,227k BBC articles
  • CNN/Dailymail,93k articles from the CNN, 220k articles from the Daily Mail
  • NEWSROOM,3M article-summary pairs written by authors and editors in the newsrooms of 38 major publications
  • Multi-News,56k pairs of news articles and their human-written summaries from the http://com
  • Gigaword,4M examples extracted from news articles,the task is to generate theheadline from the first sentence
  • arXiv, PubMed,two long documentdatasets of scientific publications from http://org(113k) andPubMed (215k). The task is to generate the abstract fromthe paper body.
  • BIGPATENT,3 millionU.S. patents along with human summaries under nine patent classification categories

2.公开数据集 (中文)

本文以 Multi-News 为例,数据分为两列,headlines 代表摘要,text 代表全文。由于文本数据集较小,故直接官网下载原始 csv 文件上传到 SageMaker Notebook 即可。如下是部分数据集样例。


找到 hp_data.ipynb 运行代码。




  1. class Settings:
  2. TRAIN_DATA = "./data/hp/summary/news_summary_total.csv"
  3. Columns = ['headlines', 'text']
  4. encoding = 'latin-1'
  5. columns_dict = {"headlines": "headlines", "text": "text"}
  6. df_column_list = ['text', 'headlines']
  8. SOURCE_TEXT_KEY = 'text'
  9. TEST_SIZE = 0.2
  10. BATCH_SIZE = 16
  11. source_max_token_len = 128
  12. target_max_token_len = 50
  13. train_df_len = 82332
  14. test_df_len = 20583
  15. class Preprocess:
  16. def __init__(self):
  17. self.settings = Settings
  18. def clean_text(self, text):
  19. text = text.lower()
  20. text = re.sub('\[.*?\]', '', text)
  21. text = re.sub('https?://\S+|www\.\S+', '', text)
  22. text = re.sub('<.*?>+', '', text)
  23. text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
  24. text = re.sub('\n', '', text)
  25. text = re.sub('\w*\d\w*', '', text)
  26. return text
  27. def preprocess_data(self, data_path):
  28. df = pd.read_csv(data_path, encoding=self.settings.encoding, usecols=self.settings.Columns)
  29. # simpleT5 expects dataframe to have 2 columns: "source_text" and "target_text"
  30. df = df.rename(columns=self.settings.columns_dict)
  31. df = df[self.settings.df_column_list]
  32. # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
  33. df[self.settings.SOURCE_TEXT_KEY] = df[self.settings.SOURCE_TEXT_KEY]
  34. return df
  35. settings=Settings
  36. preprocess=Preprocess()
  37. df = preprocess.preprocess_data(settings.TRAIN_DATA)


  1. df.to_csv('./data/hp/summary/news_summary_cleaned.csv',index=False)
  2. df2=pd.read_csv('./data/hp/summary/news_summary_cleaned.csv')
  3. order=['text','headlines']
  4. df3=df2[order]
  5. train_df, test_df = train_test_split(df3, test_size=0.2,random_state=100)
  6. train_df.to_csv('./data/hp/summary/news_summary_cleaned_train.csv',index=False)
  7. test_df.to_csv('./data/hp/summary/news_summary_cleaned_test.csv',index=False)



在完成了上述的数据处理过程后,就可以进行模型训练了,下面的命令运行后即开始模型训练,代码会自动 Huggingface hub 中加载 google/pegasus-large 作为预训练模型,而后使用我们处理后的数据集进行模型训练。

  1. !python -u examples/pytorch/summarization/run_summarization.py \
  2. --model_name_or_path google/pegasus-large \
  3. --do_train \
  4. --do_eval \
  5. --per_device_train_batch_size=2 \
  6. --per_device_eval_batch_size=1 \
  7. --save_strategy epoch \
  8. --evaluation_strategy epoch \
  9. --overwrite_output_dir \
  10. --predict_with_generate \
  11. --train_file './data/hp/summary/news_summary_cleaned_train.csv' \
  12. --validation_file './data/hp/summary/news_summary_cleaned_test.csv' \
  13. --text_column 'text' \
  14. --summary_column 'headlines' \
  15. --output_dir='./models/local_train/pegasus-hp' \
  16. --num_train_epochs=1.0 \
  17. --eval_steps=500 \
  18. --save_total_limit=3 \
  19. --source_prefix "summarize: " > train_pegasus.log



并且会对验证集的数据进行客观指标评估,这里使用 Rouge 进行评估。





  1. import pandas as pd
  2. df=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')
  3. print('原文:',df.loc[0,'text'])
  4. print('真实标签:',df.loc[0,'headlines'])
  5. from transformers import pipeline
  6. summarizer=pipeline("summarization",model="./models/local_train/Pegasus-hp/checkpoint-500")
  7. print('模型预测:',summarizer(df.loc[0,'text'], max_length=50)[0]['summary_text'])


  1. 原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.
  2. 真实标签: Germany accuses Vietnam of kidnapping asylum seeker
  3. 模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home



使用 Amazon SageMaker BYOS 进行模型训练

在上文的范例中,我们使用本地环境一步步的训练了一个较小的模型,验证了我们的代码。现在,我们需要把代码进行整理,在 Amazon SageMaker 上,进行可扩展至分布式的托管训练任务。

首先,我们要将上文的训练代码整理至一个 python 脚本,然后使用 SageMaker 上预配置的 Huggingface 容器,我们提供了很多灵活的使用方式来使用该容器,具体可以参考 Hugging Face Estimator

由于 SageMaker 预置的 Huggingface 容器已经具备推理逻辑, 故这里只需要将上一步中的训练脚本引入容器即可, 具体流程如下:

启动一个 Jupyter Notebook,选择 python3 作为解释器完成如下工作:


  1. import sagemaker
  2. import os
  3. sess = sagemaker.Session()
  4. role = sagemaker.get_execution_role()
  5. print(f"sagemaker role arn: {role}")
  6. print(f"sagemaker bucket: {sess.default_bucket()}")
  7. print(f"sagemaker session region: {sess.boto_region_name}")

数据上传到 S3

  1. # dataset used
  2. dataset_name = ' news_summary'
  3. # s3 key prefix for the data
  4. s3_prefix = 'datasets/news_summary'
  5. WORK_DIRECTORY = './data/'
  6. data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=s3_prefix)
  7. data_location


定义超参数并初始化 estimator。

  1. from sagemaker.huggingface import HuggingFace
  2. # hyperparameters which are passed to the training job
  3. hyperparameters={'text_column':'text',
  4. 'summary_column':'headlines',
  5. 'train_file':'/opt/ml/input/data/train/news_summary_cleaned_train.csv',
  6. 'validation_file':'/opt/ml/input/data/test/ news_summary_cleaned_test.csv',
  7. 'output_dir':'/opt/ml/model',
  8. 'do_train':True,
  9. 'do_eval':True,
  10. 'max_source_length': 128,
  11. 'max_target_length': 128,
  12. 'model_name_or_path': 't5-large',
  13. 'learning_rate': 3e-4,
  14. 'num_train_epochs': 1,
  15. 'per_device_train_batch_size': 2,#16
  16. 'gradient_accumulation_steps':2,
  17. 'save_strategy':'epoch',
  18. 'evaluation_strategy':'epoch',
  19. 'save_total_limit':1,
  20. }
  21. distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
  22. # create the Estimator
  23. huggingface_estimator = HuggingFace(
  24. entry_point='run_paraphrase.py',
  25. source_dir='./scripts',
  26. instance_type='ml.p3.2xlarge',#'ml.p3dn.24xlarge'
  27. instance_count=1,
  28. role=role,
  29. max_run=24*60*60,
  30. transformers_version='4.6',
  31. pytorch_version='1.7',
  32. py_version='py36',
  33. volume_size=128,
  34. hyperparameters = hyperparameters,
  35. # distribution=distribution
  36. )


  1. huggingface_estimator.fit(
  2. {'train': data_location+'/news_summary_cleaned_train.csv',
  3. 'test': data_location+'/news_summary_cleaned_test.csv',}
  4. )


训练启动后,我们可以在 Amazon SageMaker 控制台看到这个训练任务,点进详情可以看到训练的日志输出,以及监控机器的 GPU、CPU、内存等的使用率等情况,以确认程序可以正常工作。训练完成后也可以在 CloudWatch 中查看训练日志。




  1. from sagemaker.huggingface.model import HuggingFaceModel
  2. # create Hugging Face Model Class
  3. huggingface_model = HuggingFaceModel(
  4. # env= {'HF_TASK':'text-generation'},
  5. model_data="s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2022-04-19-05-56-07-474/output/model.tar.gz", # path to your trained SageMaker model
  6. role=role, # IAM role with permissions to create an endpoint
  7. transformers_version="4.6", # Transformers version used
  8. pytorch_version="1.7", # PyTorch version used
  9. py_version='py36', # Python version used
  10. )
  11. predictor = huggingface_model.deploy(
  12. initial_instance_count=1,
  13. instance_type="ml.g4dn.xlarge"
  14. )



  1. from sagemaker.huggingface.model import HuggingFacePredictor
  2. predictor=HuggingFacePredictor(endpoint_name='huggingface-pytorch-inference-2022-04-19-06-41-55-309')
  3. import time
  4. s=time.time()
  5. df=pd.read_csv('./data/hp/summary/news_summary_cleaned_small_test.csv')
  6. print('原文:',df.loc[0,'text'])
  7. print('真实标签:',df.loc[0,'headlines'])
  8. out=predictor.predict({
  9. 'inputs': df.loc[0,'text'],
  10. "parameters": {"max_length": 256},
  11. })
  12. e=time.time()
  13. print('模型预测:' out)



  1. 原文: Germany on Wednesday accused Vietnam of kidnapping a former Vietnamese oil executive Trinh Xuan Thanh, who allegedly sought asylum in Berlin, and taking him home to face accusations of corruption. Germany expelled a Vietnamese intelligence officer over the suspected kidnapping and demanded that Vietnam allow Thanh to return to Germany. However, Vietnam said Thanh had returned home by himself.
  2. 真实标签: Germany accuses Vietnam of kidnapping asylum seeker
  3. 模型预测: Germany accuses Vietnam of kidnapping ex-oil exec, taking him home

Amazon SageMaker

以上就是使用 Amazon SageMaker 构建文本摘要应用的全部过程,可以看到通过 Amazon SageMaker 可以非常便利地结合 Huggingface 进行 NLP 模型的搭建,训练,部署的全流程。整个过程仅需要准备训练脚本以及数据即可通过若干命令启动训练和部署,同时,我们后续还会推出,使用 Amaozn SageMaker 进行更多 NLP 相关任务的实现方式,敬请关注。








