当前位置:   article > 正文

huggingface数据集加载错误;huggingface.datasets无法加载数据集和指标的解决方案:datasets.exceptions.DatasetGenerationError: A_datasets.exceptions.datasetgenerationerror: an err

datasets.exceptions.datasetgenerationerror: an error occurred while generati

datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

1. 错误描述

在使用huggingface的datasets包时,出现无法加载数据集和指标的问题,加载以前下载和保存的数据集,如HuggingFace课程中所述:

issues_dataset = load_dataset("json", data_files="issues/datasets-issues.jsonl", split="train")
  • 1

报错:

datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Using custom data configuration default-950028611d2860c8 Downloading
and preparing dataset json/default to
[…]/.cache/huggingface/datasets/json/default-950028611d2860c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51…
Downloading data files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 500.63it/s]
Generating train split: 2619 examples [00:00, 7155.72
examples/s]Traceback (most recent call last): File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1831, in _prepare_split_single
writer.write_table(table) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\arrow_writer.py”,
line 567, in write_table
pa_table = table_cast(pa_table, self._schema) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2282, in table_cast
return cast_table_to_schema(table, schema) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2241, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2241, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1807, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1807, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2035, in cast_array_to_feature
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2035, in
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1809, in wrapper
return func(array, *args, **kwargs) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2101, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1809, in wrapper
return func(array, *args, **kwargs) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1990, in array_cast
raise TypeError(f"Couldn’t cast array of type {array.type} to {pa_type}“) TypeError: Couldn’t cast array of type timestamp[s] to
null The above exception was the direct cause of the following
exception: Traceback (most recent call last): File “C:\Program
Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev\pydevconsole.py”, line 364, in runcode
coro = func() File “”, line 1, in File “C:\Program Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py”, line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File “C:\Program Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py”, line 18, in execfile
exec(compile(contents+”\n", file, ‘exec’), glob, loc) File “[…]\PycharmProjects\TransformersTesting\dataset_issues.py”, line
20, in
issues_dataset = load_dataset(“json”, data_files=“issues/datasets-issues.jsonl”, split=“train”) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\load.py”,
line 1757, in load_dataset
builder_instance.download_and_prepare( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 860, in download_and_prepare
self._download_and_prepare( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 953, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1706, in _prepare_split
for job_id, done, content in self._prepare_split_single( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1849, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e datasets.builder.DatasetGenerationError: An error
occurred while generating the dataset Generating train split: 2619
examples [00:19, 7155.72 examples/s]

以及:

import datasets
dataset=datasets.load_dataset("yelp_review_full")
  • 1
  • 2

后面报错:

ConnectionError Traceback (most recent call
last) /tmp/ipykernel_21708/3707219471.py in
----> 1 dataset=datasets.load_dataset(“yelp_review_full”)

myenv/lib/python3.8/site-packages/datasets/load.py in
load_dataset(path, name, data_dir, data_files, split, cache_dir,
features, download_config, download_mode, ignore_verifications,
keep_in_memory, save_infos, revision, use_auth_token, task, streaming,
**config_kwargs) 1658 1659 # Create a dataset builder
-> 1660 builder_instance = load_dataset_builder( 1661 path=path, 1662 name=name,

myenv/lib/python3.8/site-packages/datasets/load.py in
load_dataset_builder(path, name, data_dir, data_files, cache_dir,
features, download_config, download_mode, revision, use_auth_token,
**config_kwargs) 1484 download_config = download_config.copy() if download_config else DownloadConfig()
1485 download_config.use_auth_token = use_auth_token
-> 1486 dataset_module = dataset_module_factory( 1487 path, 1488 revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in
dataset_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, data_dir, data_files,
**download_kwargs) 1236 f"Couldn’t find ‘{path}’ on the Hugging Face Hub either: {type(e1).name}: {e1}"
1237 ) from None
-> 1238 raise e1 from None 1239 else: 1240 raise FileNotFoundError(

myenv/lib/python3.8/site-packages/datasets/load.py in
dataset_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, data_dir, data_files,
**download_kwargs) 1173 if path.count(“/”) == 0: # even though the dataset is on the Hub, we get it from GitHub for now
1174 # TODO(QL): use a Hub dataset module factory
instead of GitHub
-> 1175 return GithubDatasetModuleFactory( 1176 path, 1177 revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
531 revision = self.revision
532 try:
–> 533 local_path = self.download_loading_script(revision)
534 except FileNotFoundError:
535 if revision is not None or os.getenv(“HF_SCRIPTS_VERSION”, None) is not None:

myenv/lib/python3.8/site-packages/datasets/load.py in
download_loading_script(self, revision)
511 if download_config.download_desc is None:
512 download_config.download_desc = “Downloading builder script”
–> 513 return cached_path(file_path, download_config=download_config)
514
515 def download_dataset_infos_file(self, revision: Optional[str]) -> str:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
–> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
get_from_cache(url, cache_dir, force_download, proxies, etag_timeout,
resume_download, user_agent, local_files_only, use_etag, max_retries,
use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}“)
581 if head_error is not None:
–> 582 raise ConnectionError(f"Couldn’t reach {url} ({repr(head_error)})”)
583 elif response is not None:
584 raise ConnectionError(f"Couldn’t reach {url} (error {response.status_code})")

ConnectionError: Couldn’t reach
https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py
(ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘raw.githubusercontent.com’,
port=443): Read timed out. (read timeout=100)”)))

2. 解决办法

参考:https://github.com/huggingface/datasets/issues/5422

Google上的一些解决方法的建议是不要使用 load_dataset(),而是使用不同的 API 函数。另一种选择是使用流式处理使用相同的功能。

这里的问题是由于特色网络问题无法直接加载,对于某些国家或地区的研究人员来说,通常是由于复杂的网络环境而禁用了 load_dataset 下载能力。所以解决方案是使用 load_from_disk 而不是 load_dataset

参照官方文档:
https://huggingface.co/docs/datasets/v1.0.2/_modules/datasets/load.html
在这里插入图片描述

分析

很明显这是上不了raw.githubusercontent.com的问题。
如果你可以使用代理,最好的解决方式就是直接挂代理运行全程。

对于不方便直接使用代理的情况,以下介绍我使用的解决方案:在本机使用代理,然后将文件上传到运行环境的解决方案。(注意本机和服务器可以是不同操作系统的)

我试过直接把这个Python文件下载下来,然后上传到服务器上,但是操作了半天也不行,因为这个Python文件里面给出的数据下载链接在谷歌云,但是直接把那个数据下下来上传还是不行,修改数据下载链接到S3文件也不行。总之不行,如果有可行的方法请直接给我讲一下。
大略来说,我的成功做法就是现在本地加载数据集,然后储存到磁盘,然后将文件夹上传至服务器,并从磁盘直接加载数据集。

在本地加载数据集并储存到本地磁盘(注意这个路径是Windows系统的路径):

import datasets
dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache')

dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')

  • 1
  • 2
  • 3
  • 4
  • 5

将路径文件夹上传到服务器:
可以使用bypy和百度网盘来进行操作,bypy:使用Linux命令行上传及下载百度云盘文件。
先上传到我的应用数据-bypy文件夹中,然后在服务器上下载文件夹(注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹,而不是直接下载整个文件夹):

bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk
  • 1

然后在服务器上从磁盘加载数据集:

dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")

  • 1
  • 2

就可以正常使用数据集了。

注意,根据datasets的文档,这个数据集也可以直接存储到S3FileSystem(https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.filesystems.S3FileSystem)上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API?感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能,所以没有使用这个。

指标的:
代码:

metric=datasets.load_metric('accuracy')

  • 1
  • 2

报错:

ConnectionError Traceback (most recent call last) /tmp/ipykernel_24141/2186493793.py in
----> 1 metric=datasets.load_metric(‘accuracy’)

myenv/lib/python3.8/site-packages/datasets/load.py in
load_metric(path, config_name, process_id, num_process, cache_dir,
experiment_id, keep_in_memory, download_config, download_mode,
revision, **metric_init_kwargs) 1390 “”" 1391
download_mode = DownloadMode(download_mode or
DownloadMode.REUSE_DATASET_IF_EXISTS)
-> 1392 metric_module = metric_module_factory( 1393 path, revision=revision, download_config=download_config,
download_mode=download_mode 1394 ).module_path

myenv/lib/python3.8/site-packages/datasets/load.py in
metric_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, **download_kwargs) 1322
except Exception as e2: # noqa: if it’s not in the cache, then it
doesn’t exist. 1323 if not isinstance(e1,
FileNotFoundError):
-> 1324 raise e1 from None 1325 raise FileNotFoundError( 1326 f"Couldn’t find a
metric script at {relative_to_absolute_path(combined_path)}. "

myenv/lib/python3.8/site-packages/datasets/load.py in
metric_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, **download_kwargs) 1310
elif is_relative_path(path) and path.count(“/”) == 0 and not
force_local_path: 1311 try:
-> 1312 return GithubMetricModuleFactory( 1313 path, 1314 revision=revision,

myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
598 revision = self.revision
599 try:
–> 600 local_path = self.download_loading_script(revision)
601 revision = self.revision
602 except FileNotFoundError:

myenv/lib/python3.8/site-packages/datasets/load.py in
download_loading_script(self, revision)
592 if download_config.download_desc is None:
593 download_config.download_desc = “Downloading builder script”
–> 594 return cached_path(file_path, download_config=download_config)
595
596 def get_module(self) -> MetricModule:

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
–> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,

myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
get_from_cache(url, cache_dir, force_download, proxies, etag_timeout,
resume_download, user_agent, local_files_only, use_etag, max_retries,
use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}“)
581 if head_error is not None:
–> 582 raise ConnectionError(f"Couldn’t reach {url} ({repr(head_error)})”)
583 elif response is not None:
584 raise ConnectionError(f"Couldn’t reach {url} (error {response.status_code})")

ConnectionError: Couldn’t reach
https://raw.githubusercontent.com/huggingface/datasets/2.0.0/metrics/accuracy/accuracy.py
(ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘raw.githubusercontent.com’,
port=443): Read timed out. (read timeout=100)”)))

指标的简单一点,只要把这个Python文件下载到本地(这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文,但是可以参考:PyG的Planetoid无法直接下载Cora等数据集的3个解决方式,然后改为调用这个文件即可:

metric=datasets.load_metric('mypath/accuracy.py')

  • 1
  • 2

参考

  1. datasets加载数据集相关方法的文档:https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods
  2. datasets.save_to_disk()的文档:https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk
  3. HuggingFace使用datasets加载数据时 出现ConnectionError 无法获得数据 可以将数据保存到本地_zero requiem的博客-CSDN博客:这一篇使用的方法跟我的差不多,他用的是google colab来加载和存储数据集:
    https://blog.csdn.net/weixin_43201090/article/details/123308940
  4. ConnectionError: Couldn‘t reach https://raw.githubuserc//huggingface/datasets/1.15.1/datasets/squad/_随便写写诶的博客-CSDN博客:呃感觉这篇可能是因为datasets版本比较早,所以我看现在数据集不再存储在那个位置了,可能这个方法无法使用了:
    https://blog.csdn.net/qq_38178543/article/details/122385135
  5. HuggingFace代码本地运行报错ConnectionError: Couldn‘t reach https://raw.githubuserc_愚昧之山绝望之谷开悟之坡的博客-CSDN博客:这个方法我试过,我把Python文件放到cache文件夹后,发现需要下载谷歌云数据。我把谷歌云数据也放到cache文件夹后,它还是给我报一些别的错,我不会解决,所以放弃了这个解决思路。
    https://blog.csdn.net/qq_15821487/article/details/121069536
  6. HuggingFace 加载数据集报错 ConnectionError 无需GoogleColab_zero requiem的博客-CSDN博客:和序号4的情况类似。https://blog.csdn.net/weixin_43201090/article/details/123299618
  7. 使用datasets库加载glue数据集时load_dataset发生Connection Error问题解决方法_j_thame_myhome的博客-CSDN博客_datasets.load_dataset:升级datasets版本对我的情况无效,因为2.0.0已经是现在最新的datasets版本了。https://blog.csdn.net/j_thame_myhome/article/details/118756653
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/378220
推荐阅读
相关标签
  

闽ICP备14008679号