赞
踩
# 导入fetch_20newsgroups函数,用于获取20个新闻组数据集
from sklearn.datasets import fetch_20newsgroups
# 导入pandas库,用于数据处理
import pandas as pd
# 导入openai库,用于人工智能相关操作
import openai
# 定义要获取的新闻组类别
categories = ['rec.sport.baseball', 'rec.sport.hockey']
# 使用fetch_20newsgroups函数获取训练集数据,subset参数指定为'train',shuffle参数指定为True,random_state参数指定为42,categories参数指定为上面定义的类别
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)
可以使用sklearn加载新闻组数据集。首先,我们将查看数据本身:
# 打印出体育数据集中的第一条数据
print(sports_dataset['data'][0])
From: dougb@comm.mot.com (Doug Bank) Subject: Re: Info needed for Cleveland tickets Reply-To: dougb@ecs.comm.mot.com Organization: Motorola Land Mobile Products Sector Distribution: usa Nntp-Posting-Host: 145.1.146.35 Lines: 17 In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes: |> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18. |> Does anybody know if the Tribe will be in town on those dates, and |> if so, who're they playing and if tickets are available? The tribe will be in town from April 16 to the 19th. There are ALWAYS tickets available! (Though they are playing Toronto, and many Toronto fans make the trip to Cleveland as it is easier to get tickets in Cleveland than in Toronto. Either way, I seriously doubt they will sell out until the end of the season.) -- Doug Bank Private Systems Division dougb@ecs.comm.mot.com Motorola Communications Sector dougb@nwu.edu Schaumburg, Illinois dougb@casbah.acns.nwu.edu 708-576-8207
# 获取第一条新闻的目标类别名称
sports_dataset.target_names[sports_dataset['target'][0]] # 返回值为字符串,表示第一条新闻所属的类别名称
'rec.sport.baseball'
# 统计数据集中的样本数量
len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
# 打印总样本数、棒球样本数和曲棍球样本数
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")
Total examples: 1197, Baseball examples: 597, Hockey examples: 600
棒球类别中的一个样本如上所示。这是一封发给邮件列表的电子邮件。我们可以观察到,我们总共有1197个例子,这些例子在两个运动之间均匀分布。
我们将数据集转换为一个pandas dataframe,其中包含一个用于提示的列和一个用于完成的列。提示包含邮件列表中的电子邮件,完成是一个运动的名称,可以是冰球或棒球。仅用于演示目的和微调速度,我们只选择了300个示例。在实际使用中,示例越多,性能越好。
# 导入pandas库
# 从sports_dataset中获取target_names,并将其转换为只包含最后一个元素的列表
labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
# 从sports_dataset中获取data,并去除每个文本的前后空格
texts = [text.strip() for text in sports_dataset['data']]
# 使用zip函数将texts和labels合并为一个元组,并使用DataFrame函数将其转换为DataFrame对象
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
# 打印DataFrame的前几行数据
df.head()
prompt | completion | |
---|---|---|
0 | From: dougb@comm.mot.com (Doug Bank)\nSubject:... | baseball |
1 | From: gld@cunixb.cc.columbia.edu (Gary L Dare)... | hockey |
2 | From: rudy@netcom.com (Rudy Wade)\nSubject: Re... | baseball |
3 | From: monack@helium.gas.uug.arizona.edu (david... | hockey |
4 | Subject: Let it be Known\nFrom: <ISSBTL@BYUVM.... | baseball |
棒球和曲棍球都是单个标记。我们将数据集保存为jsonl文件。
# 将DataFrame数据保存为JSON格式的文件
df.to_json("sport2.jsonl", orient='records', lines=True)
现在我们可以使用一个数据准备工具,在微调之前对我们的数据集提出一些建议的改进。在启动工具之前,我们会更新openai库,以确保我们使用的是最新的数据准备工具。我们还额外指定了-q
选项,以自动接受所有建议。
# 安装openai库的最新版本
!pip install --upgrade openai
# 导入openai工具包中的fine_tunes.prepare_data模块
# 调用prepare_data模块中的函数,对名为sport2.jsonl的数据文件进行处理
# -f参数指定要处理的数据文件名,-q参数指定要处理的数据文件所在的路径
!openai tools fine_tunes.prepare_data -f sport2.jsonl -q
Analyzing... - Your file contains 1197 prompt-completion pairs - Based on your data it seems like you're trying to fine-tune a model for classification - For classification, we recommend you try one of the faster and cheaper models, such as `ada` - For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training - There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174] For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens. - Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty - The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details Based on the analysis we will perform the following actions: - [Recommended] Remove 11 long examples [Y/n]: Y - [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y - [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y - [Recommended] Would you like to split into training and validation set? [Y/n]: Y Your data will be written to a new JSONL file. Proceed [Y/n]: Y Wrote modified files to `sport2_prepared_train.jsonl` and `sport2_prepared_valid.jsonl` Feel free to take a look! Now use that file when fine-tuning: > openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
工具会有帮助性地对数据集提出一些建议,并将数据集分成训练集和验证集。
在提示和完成之间需要一个后缀来告诉模型输入文本已经结束,现在需要预测类别。由于我们在每个示例中使用相同的分隔符,模型能够学习到它应该在分隔符后面预测棒球或曲棍球。
在完成中使用空格前缀是有用的,因为大多数单词标记都是以空格前缀进行标记化。
工具还识别出这很可能是一个分类任务,因此建议将数据集分成训练集和验证集。这将使我们能够轻松地测量对新数据的预期性能。
该工具建议我们运行以下命令来训练数据集。由于这是一个分类任务,我们想知道在提供的验证集上的泛化性能如何,以满足我们的分类用例。该工具建议添加 --compute_classification_metrics --classification_positive_class " baseball"
以计算分类指标。
我们可以直接从CLI工具中复制建议的命令。我们特别添加了 -m ada
来微调一个更便宜和更快的ada模型,通常在分类用例上与更慢和更昂贵的模型在性能上相当。
# Fine-tuning OpenAI API for sport classification
# This code is used to fine-tune the OpenAI API for sport classification. It takes in two input files, "sport2_prepared_train.jsonl" and "sport2_prepared_valid.jsonl", which contain the training and validation data respectively.
# The "--compute_classification_metrics" flag is used to compute classification metrics during the fine-tuning process.
# The "--classification_positive_class" flag is set to "baseball" to specify that "baseball" is the positive class for the classification task.
# The "-m ada" flag specifies the model to be used for fine-tuning, in this case, the Ada model.
!openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" -m ada
Upload progress: 100%|████████████████████| 1.52M/1.52M [00:00<00:00, 1.81Mit/s]
Uploaded file from sport2_prepared_train.jsonl: file-Dxx2xJqyjcwlhfDHpZdmCXlF
Upload progress: 100%|███████████████████████| 388k/388k [00:00<00:00, 507kit/s]
Uploaded file from sport2_prepared_valid.jsonl: file-Mvb8YAeLnGdneSAFcfiVcgcN
Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
Streaming events until fine-tuning is complete...
(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2021-07-30 13:15:50] Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
[2021-07-30 13:15:52] Fine-tune enqueued. Queue number: 0
[2021-07-30 13:15:56] Fine-tune started
[2021-07-30 13:18:55] Completed epoch 1/4
[2021-07-30 13:20:47] Completed epoch 2/4
[2021-07-30 13:22:40] Completed epoch 3/4
[2021-07-30 13:24:31] Completed epoch 4/4
[2021-07-30 13:26:22] Uploaded model: ada:ft-openai-2021-07-30-12-26-20
[2021-07-30 13:26:27] Uploaded result file: file-6Ki9RqLQwkChGsr9CHcr1ncg
[2021-07-30 13:26:28] Fine-tune succeeded
Job complete! Status: succeeded 声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/343894
推荐阅读
相关标签
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。