赞
踩
格式化自我意识数据用于ChatGLM微调
https://github.com/hiyouga/ChatGLM-Efficient-Tuning
cd data
self_cognition.json
#!/usr/bin/python # -*- coding: UTF-8 -*- # 读取self_cognition自我认知解析并写入转换新文件 import json # 读取self_cognition文件中的JSON列表 with open('self_cognition.json', 'r', encoding='utf-8') as f: data = json.load(f) # 处理content和summary def process_data(item): # 将instruction对应到content,output对应到summary item['content'] = item['instruction'].replace(' ', '') item['summary'] = item['output'].replace(' <NAME>', 'AI小木').replace('<AUTHOR>', '小吕').replace(' ', '') return item # 将处理后的数据写入B文件 with open('self_cognition/train.json', 'w', encoding='utf-8') as f: for item in data: process_item = process_data(item) # 将一行JSON对象写入文件 f.write('{"content":"'+process_item['content']+'","summary":"'+process_item['summary']+'"}') f.write('\n')
名称:AI小木
作者:小吕
可以自己替换
python self_process.py
我的train.json与dev.json一致,后期再处理吧
data/
├── dataset_info.json
└── self_cognition/
├── dev.json
└── train.json
接下来,我们修改 dataset_info.json,增加以下两列内容,从而使训练框架能够识别自定义数据集。
, "self_cognition_train": { "file_name": "self_cognition/train.json", "columns": { "prompt": "content", "query": "", "response": "summary", "history": "" } }, "self_cognition_dev": { "file_name": "self_cognition/dev.json", "columns": { "prompt": "content", "query": "", "response": "summary", "history": "" } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。