赞
踩
import torch from datasets import load_dataset class Dataset(torch.utils.data.Dataset): def __init__(self, split): self.dataset = load_dataset( path = "seamew/ChnSentiCorp", split = split ) def __len__(self): return len(self.dataset) def __getitem__(self, i): text = self.dataset[i]["text"] label = self.dataset[i]["label"] return text, label dataset = Dataset("train") print(len(dataset), dataset[0])
命令台输出:
Using the latest cached version of the module from C:\Users\admin\.cache\huggingface\modules\datasets_modules\datasets\seamew--ChnSentiCorp\1f242195a37831906957a11a2985a4329167e60657c07dc95ebe266c03fdfb85 (last modified on Fri Jul 7 21:11:26 2023) since it couldn't be found locally at seamew/ChnSentiCorp., or remotely on the Hugging Face Hub.
Using custom data configuration default
D:\Softwares\Anaconda3\Anaconda3\lib\site-packages\scipy\__init__.py:155: UserWarning: A NumPy version >=1.18.5 and <1.25.0 is required for this version of SciPy (detected version 1.25.0
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Downloading and preparing dataset chn_senti_corp/default to C:\Users\admin\.cache\huggingface\datasets\seamew___chn_senti_corp\default\0.0.0\1f242195a37831906957a11a2985a4329167e60657c07dc95ebe266c03fdfb85...
在运行 train_path = dl_manager.download_and_extract(_TRAIN_DOWNLOAD_URL)
这一行的时候报错
报错 ConnectionError
原因:链接不到谷歌云盘
FileNotFoundError
git
命令等方式把这些文件下载本地load_from_disk
)加载数据库import torch from datasets import load_from_disk class Dataset(torch.utils.data.Dataset): def __init__(self, split): self.dataset = load_from_disk( "E:/Repo/NLP/23.7.7 Huggingface_Learn/ChnSentiCorp" ) def __len__(self): return len(self.dataset) def __getitem__(self, i): text = self.dataset[i]["text"] label = self.dataset[i]["label"] return text, label dataset = Dataset("train") print(len(dataset), dataset[0])
再次报错 File Not Found
state.json
文件filename
为目录下希望加载的文件,这里加载 train
数据_split
也同步改成 train
_fingerprint
后面是啥玩意儿…{
"_data_files": [
{
"filename": "chn_senti_corp-train.arrow"
}
],
"_fingerprint": "24c4fd9824d8b978",
"_format_columns": null,
"_format_kwargs": {},
"_format_type": null,
"_indexes": {},
"_output_all_columns": false,
"_split": "train"
}
json
文件如下:{ "_data_files": [ { # 对应上图中的文件名 "filename": "chn_senti_corp-train.arrow" } ], "_fingerprint": "24c4fd9824d8b978", "_format_columns": null, "_format_kwargs": {}, "_format_type": null, "_indexes": {}, "_output_all_columns": false, # 加载训练集数据 若为验证集 'validation' 测试集 'test' "_split": "train" }
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 43: inv
gbk
也不管用Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。