赞
踩
T0 benchmark
(或者叫P3
)是一个大规模的人工标注instruction tuning数据集,在ICLR 2021 T0一文中提出,其收集了来自huggingface hub上的多任务数据,并为每一个task都装备了来自prompt source的人工撰写指令。
P3数据集可以在huggingface上找到:链接
然而我们下载之后会发现,所有数据文件都以.tfrecord
,而且打开之后的内容非常怪异:
如上图所示,下载的所有数据文件,都是Git LFS
文件(也就是large file system),可以简单理解成pointer,记载了数据所在的存储地址,而真正的数据其实并未下载成功,它们还存储在pointer指向的远端服务器上。
要完全下载P3,需要遵从如下步骤:
首先要下载git-lfs
, git lfs 官网:https://git-lfs.com/
以mac为例:brew install git-lfs
就可以直接安装。
linux的话需要有sudo权限,目前笔者并未找到靠谱的linux上源码安装git-lfs的方法,具体参考:https://askubuntu.com/questions/799341/how-to-install-git-lfs-on-ubuntu-16-04
接下来就直接把p3从huggingface hub上克隆下来:
git clone https://huggingface.co/datasets/bigscience/P3
克隆下来之后,会发现整个仓库其实很小,如前面所述,这个时候并未把正真的数据下载,而只是下载了所有LFS文件罢了。
接下来先进入P3的根目录,
然后用如下命令,使用git-lfs将所有lfs文件指向的远端文件,都下载下来:
git lfs install # git-lfs initialization, set `--force` if there is any errors
git lfs pull # download all files pointed by lfs
然后就进入了漫长等待。。。整个P3的完整数据应该差不多几百个G大小。。
由于整个数据集实在太过庞大,所以我们可以选择只下载部分我们需要的数据。
例如,笔者只希望下载T0的 held-out evaluation set(测试集),没有必要把整个庞大的数据集都给下载下来,所以可以先用如下python脚本,将不需要的tasks全部删掉之后再去下载(读者根据自己需要修改代码):
# remain ANLI R1-R3, CB,COPA and RTE tasks import os import shutil def get_directories(path): directories = [] for entry in os.scandir(path): if entry.is_dir(): directories.append(entry.name) return directories def target_task_dir(directory): ''' only return true when facing with the target task directory. ''' directory = directory.lower() if "anli" in directory and ("r1" in directory or "r2" in directory or "r3" in directory): return True elif "cb" in directory: # super_glue CB return True elif "copa" in directory: # super_glue COPA return True elif "rte" in directory: # super_glue RTE return True else: return False path = "./data" directories = get_directories(path) for directory in directories: if not target_task_dir(directory): # del this directory (including all files in it) shutil.rmtree(os.path.join(path,directory))
将上述脚本放到P3根目录,python 运行就可以。
删除之后,记得得git保存一下修改:
git add -A
git commit -m "del unused tasks"
之后再去git lfs pull
。
tfrecord是一种专属于tensorflow的二进制文件。虽然我们现在已经下载好了所有数据,但是你会发现这些文件仍然无法直接被编辑器打开,我们也没办法直接查看数据内容。
此时,如果你的代码使用的是tensorflow,那么现在下载好的这些tfrecord文件就可以直接用来构建dataset class,训练模型。除此之外,tfrecord文件相较于其他默认的数据格式类型,不仅节省存储,也能让模型训练、推理也能更高效(详见该博客:https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)。
但如果你用的是其他框架,比方说pytorch;或者用做其他用途,那么我们只能将tfrecord文件再转成其他格式,例如json。
以下步骤用来把T0数据集的所有tfrecord文件转化成json:
数据处理过程类似于逆向工程,所以需要用到tensorflow。
笔者下载了gpu 版本tensorflow,cpu版本应该也可以。具体安装过程自行搜索。
首先观察T0数据集目前的文件结构。每一个task folder下都会包含若干文件,其中,两种类型的文件尤为重要:
xxx.tfrecord-00000-of-00001
: 数据文件,包含了T0数据集的样本info.xxx.json
: 格式文件,定义了对应的tfrecord文件的数据格式结构,例如字段名称、数据类型所以对于每一个task folder,我们都需要将这个task下的所有xxx.tfrecord-00000-of-00001
文件转为json格式,而转换则依赖于对应的info.xxx.json
文件。
笔者这里直接附上代码(带有详细注释),整个转化过程同样需要消耗一段较长的时间:
import os # set the cuda device os.environ["CUDA_VISIBLE_DEVICES"] = "2,3,4,5,6,7" # specify GPU number, but I didn't find the gpu is acutally used (utility is always equal to 0) import json import argparse import tensorflow as tf import numpy as np from tqdm import tqdm def process_tfrecord(tfrecord_file, feature_schema): dataset = tf.data.TFRecordDataset(tfrecord_file) # features = {key: tf.io.VarLenFeature(getattr(tf, feature_schema[key]["dtype"])) for key in feature_schema} # for bool variable, simply convert to int64 features = {key: tf.io.VarLenFeature(getattr(tf, feature_schema[key]["dtype"])) if feature_schema[key]["dtype"] != "bool" else tf.io.VarLenFeature(tf.int64) for key in feature_schema} data = [] for record in dataset: try: example = tf.io.parse_single_example(record, features) except: print("error in parsing the record: {}".format(record)) print("the features are: {}".format(features)) exit() entry = {} for key, value in example.items(): if key in feature_schema: if feature_schema[key]["dtype"] in ["int64", "bool", "float", "float32"]: entry[key] = value.values.numpy().tolist() else: entry[key] = np.array(value.values).tolist() entry[key] = [v.decode('utf-8') for v in entry[key]] data.append(entry) return data def process_one_file(data_file, info_file, save_file): # read the info file with open(info_file, "r") as f: info_data = json.load(f) feature_schema = info_data["features"] # only support data values in `float`, `int64`, `string`, and `bool` (actually int64) for key, value in feature_schema.items(): # convert int32 to int64 if value["dtype"] == "int32": feature_schema[key]["dtype"] = "int64" tfrecord_file = data_file data = process_tfrecord(tfrecord_file, feature_schema) with open(save_file, "w") as f: json.dump(data, f, indent=2) def main(): parser = argparse.ArgumentParser() parser.add_argument("--meta_path", type=str, default="./data", help="path to the meta folder.") parser.add_argument("--overwrite", action="store_true", help="whether to overwrite the existing files.") args, unparsed = parser.parse_known_args() if unparsed: raise ValueError(unparsed) meta_path = args.meta_path all_subfolders = [f.path for f in os.scandir(meta_path) if f.is_dir()] # get all the folder uder "./data" data_file_names = ["train.tfrecord-00000-of-00001", "validation.tfrecord-00000-of-00001", "test.tfrecord-00000-of-00001"] info_file_names = ["info.train.json", "info.validation.json", "info.test.json"] task_cnt = 0 for idx, subfolder in tqdm(enumerate(all_subfolders), total=len(all_subfolders)): # process only if the subfolder contains the data files (one of the three) if any([os.path.exists(os.path.join(subfolder, data_file_name)) for data_file_name in data_file_names]): print("==> processing task {}...".format(subfolder)) task_cnt += 1 for data_file_name, info_file_name in zip(data_file_names, info_file_names): data_file = os.path.join(subfolder, data_file_name) info_file = os.path.join(subfolder, info_file_name) save_file = os.path.join(subfolder, data_file_name.replace(".tfrecord-00000-of-00001", ".json")) # if the target processed file exists, then we can skip it if `overwrite` is not set if os.path.exists(save_file): if args.overwrite: print("~ overwriting the existing file: {}".format(save_file)) process_one_file(data_file, info_file, save_file) else: print("~ skipping the existing file: {}".format(save_file)) else: process_one_file(data_file, info_file, save_file) else: print("skipping the subfolder: {}, since it does not contain any data files".format(subfolder)) print("\n *** processed {} no-empty task folders out of total {} folders ***".format(task_cnt, len(all_subfolders))) if __name__ == "__main__": main()
简单来讲,上述代码的处理过程主要是利用info.xxx.json
中定义的字段信息,使用tf.io.parse_single_example
将文件还原回原始数据类型。
处理之后的json文件格式,就会类似于官网定义的样子:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。