赞
踩
工程:https://github.com/microsoft/unilm
第一步:下载数据集
数据集1:Download 2014 train images, 2014 val images
数据集2:(https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)
安装以下方式存放:
/path/to/your_data/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
COCO_val2014_000000000042.jpg
...
dataset_coco.json
备注dataset_coco.json是coco的caption数据集。如果要训练自己的数据集,需要把自己的数据集制作成caption数据集。
下载处理数据的模型:(https://conversationhub.blob.core.windows.net/beit-share-public/beit3/sentencepiece/beit3.spm)
处理数据:
```python
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
CaptioningDataset.make_coco_captioning_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
)
处理过程:读取dataset_coco.json–>
具体代码
def _make_captioning_coco_karpathy_dataset_index(
data_path,
tokenizer,
split=("train", "restval"),
split_name="train",
):
coco_karpathy_split_json_file = os.path.join(data_path, "dataset_coco.json")
items = []
image_counter = set()
print("read %s" % coco_karpathy_split_json_file)
with open(coco_karpathy_split_json_file, mode="r", encoding="utf-8") as reader:
data = json.loads(reader.read())
for item in data["images"]:
if item["split"] in split:
image_path = os.path.join(item["filepath"], item["filename"])
if item["split"] in ["train", "restval"]:
for sent in item["sentences"]:
tokens = tokenizer.tokenize(sent["raw"])###这里的tokens是该图片对应的文字描述,例如a woman wearing a net on her head cutting a cake;一张图片有很多描述;把每个描述都append到items中。
token_ids = tokenizer.convert_tokens_to_ids(tokens)
items.append({
"image_path": image_path,
"text_segment": token_ids,
"image_id": item["cocoid"],
})
else:
items.append({
"image_path": image_path,
"text_segment": None,
"image_id": item["cocoid"],
})
if image_path not in image_counter:
image_counter.add(image_path)
print("Find %d images and %d image-text pairs for karpathy dataset %s split !" % \
(len(image_counter), len(items), split_name))
index_file = os.path.join(data_path, "coco_captioning.%s.jsonl" % split_name)
_write_data_into_jsonl(items, index_file)
pass
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。