赞
踩
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
在使用huggingface的datasets包时,出现无法加载数据集和指标的问题,加载以前下载和保存的数据集,如HuggingFace课程中所述:
issues_dataset = load_dataset("json", data_files="issues/datasets-issues.jsonl", split="train")
报错:
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Using custom data configuration default-950028611d2860c8 Downloading
and preparing dataset json/default to
[…]/.cache/huggingface/datasets/json/default-950028611d2860c8/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51…
Downloading data files: 100%|██████████| 1/1 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 500.63it/s]
Generating train split: 2619 examples [00:00, 7155.72
examples/s]Traceback (most recent call last): File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1831, in _prepare_split_single
writer.write_table(table) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\arrow_writer.py”,
line 567, in write_table
pa_table = table_cast(pa_table, self._schema) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2282, in table_cast
return cast_table_to_schema(table, schema) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2241, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2241, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1807, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1807, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks]) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2035, in cast_array_to_feature
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2035, in
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()] File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1809, in wrapper
return func(array, *args, **kwargs) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 2101, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1809, in wrapper
return func(array, *args, **kwargs) File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\table.py”,
line 1990, in array_cast
raise TypeError(f"Couldn’t cast array of type {array.type} to {pa_type}“) TypeError: Couldn’t cast array of type timestamp[s] to
null The above exception was the direct cause of the following
exception: Traceback (most recent call last): File “C:\Program
Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev\pydevconsole.py”, line 364, in runcode
coro = func() File “”, line 1, in File “C:\Program Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py”, line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File “C:\Program Files\JetBrains\PyCharm
2022.1.3\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py”, line 18, in execfile
exec(compile(contents+”\n", file, ‘exec’), glob, loc) File “[…]\PycharmProjects\TransformersTesting\dataset_issues.py”, line
20, in
issues_dataset = load_dataset(“json”, data_files=“issues/datasets-issues.jsonl”, split=“train”) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\load.py”,
line 1757, in load_dataset
builder_instance.download_and_prepare( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 860, in download_and_prepare
self._download_and_prepare( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 953, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs) File
“[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1706, in _prepare_split
for job_id, done, content in self._prepare_split_single( File “[…]\miniconda3\envs\HuggingFace\lib\site-packages\datasets\builder.py”,
line 1849, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e datasets.builder.DatasetGenerationError: An error
occurred while generating the dataset Generating train split: 2619
examples [00:19, 7155.72 examples/s]
以及:
import datasets
dataset=datasets.load_dataset("yelp_review_full")
后面报错:
ConnectionError Traceback (most recent call
last) /tmp/ipykernel_21708/3707219471.py in
----> 1 dataset=datasets.load_dataset(“yelp_review_full”)myenv/lib/python3.8/site-packages/datasets/load.py in
load_dataset(path, name, data_dir, data_files, split, cache_dir,
features, download_config, download_mode, ignore_verifications,
keep_in_memory, save_infos, revision, use_auth_token, task, streaming,
**config_kwargs) 1658 1659 # Create a dataset builder
-> 1660 builder_instance = load_dataset_builder( 1661 path=path, 1662 name=name,myenv/lib/python3.8/site-packages/datasets/load.py in
load_dataset_builder(path, name, data_dir, data_files, cache_dir,
features, download_config, download_mode, revision, use_auth_token,
**config_kwargs) 1484 download_config = download_config.copy() if download_config else DownloadConfig()
1485 download_config.use_auth_token = use_auth_token
-> 1486 dataset_module = dataset_module_factory( 1487 path, 1488 revision=revision,myenv/lib/python3.8/site-packages/datasets/load.py in
dataset_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, data_dir, data_files,
**download_kwargs) 1236 f"Couldn’t find ‘{path}’ on the Hugging Face Hub either: {type(e1).name}: {e1}"
1237 ) from None
-> 1238 raise e1 from None 1239 else: 1240 raise FileNotFoundError(myenv/lib/python3.8/site-packages/datasets/load.py in
dataset_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, data_dir, data_files,
**download_kwargs) 1173 if path.count(“/”) == 0: # even though the dataset is on the Hub, we get it from GitHub for now
1174 # TODO(QL): use a Hub dataset module factory
instead of GitHub
-> 1175 return GithubDatasetModuleFactory( 1176 path, 1177 revision=revision,myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
531 revision = self.revision
532 try:
–> 533 local_path = self.download_loading_script(revision)
534 except FileNotFoundError:
535 if revision is not None or os.getenv(“HF_SCRIPTS_VERSION”, None) is not None:myenv/lib/python3.8/site-packages/datasets/load.py in
download_loading_script(self, revision)
511 if download_config.download_desc is None:
512 download_config.download_desc = “Downloading builder script”
–> 513 return cached_path(file_path, download_config=download_config)
514
515 def download_dataset_infos_file(self, revision: Optional[str]) -> str:myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
–> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
get_from_cache(url, cache_dir, force_download, proxies, etag_timeout,
resume_download, user_agent, local_files_only, use_etag, max_retries,
use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}“)
581 if head_error is not None:
–> 582 raise ConnectionError(f"Couldn’t reach {url} ({repr(head_error)})”)
583 elif response is not None:
584 raise ConnectionError(f"Couldn’t reach {url} (error {response.status_code})")ConnectionError: Couldn’t reach
https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py
(ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘raw.githubusercontent.com’,
port=443): Read timed out. (read timeout=100)”)))
参考:https://github.com/huggingface/datasets/issues/5422
Google上的一些解决方法的建议是不要使用 load_dataset()
,而是使用不同的 API 函数。另一种选择是使用流式处理使用相同的功能。
这里的问题是由于特色网络问题无法直接加载,对于某些国家或地区的研究人员来说,通常是由于复杂的网络环境而禁用了 load_dataset 下载能力。所以解决方案是使用 load_from_disk
而不是 load_dataset
参照官方文档:
https://huggingface.co/docs/datasets/v1.0.2/_modules/datasets/load.html
很明显这是上不了raw.githubusercontent.com的问题。
如果你可以使用代理,最好的解决方式就是直接挂代理运行全程。
对于不方便直接使用代理的情况,以下介绍我使用的解决方案:在本机使用代理,然后将文件上传到运行环境的解决方案。(注意本机和服务器可以是不同操作系统的)
我试过直接把这个Python文件下载下来,然后上传到服务器上,但是操作了半天也不行,因为这个Python文件里面给出的数据下载链接在谷歌云,但是直接把那个数据下下来上传还是不行,修改数据下载链接到S3文件也不行。总之不行,如果有可行的方法请直接给我讲一下。
大略来说,我的成功做法就是现在本地加载数据集,然后储存到磁盘,然后将文件夹上传至服务器,并从磁盘直接加载数据集。
在本地加载数据集并储存到本地磁盘(注意这个路径是Windows系统的路径):
import datasets
dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache')
dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')
将路径文件夹上传到服务器:
可以使用bypy和百度网盘来进行操作,bypy:使用Linux命令行上传及下载百度云盘文件。
先上传到我的应用数据-bypy文件夹中,然后在服务器上下载文件夹(注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹,而不是直接下载整个文件夹):
bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk
然后在服务器上从磁盘加载数据集:
dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")
就可以正常使用数据集了。
注意,根据datasets的文档,这个数据集也可以直接存储到S3FileSystem(https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.filesystems.S3FileSystem)上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API?感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能,所以没有使用这个。
指标的:
代码:
metric=datasets.load_metric('accuracy')
报错:
ConnectionError Traceback (most recent call last) /tmp/ipykernel_24141/2186493793.py in
----> 1 metric=datasets.load_metric(‘accuracy’)myenv/lib/python3.8/site-packages/datasets/load.py in
load_metric(path, config_name, process_id, num_process, cache_dir,
experiment_id, keep_in_memory, download_config, download_mode,
revision, **metric_init_kwargs) 1390 “”" 1391
download_mode = DownloadMode(download_mode or
DownloadMode.REUSE_DATASET_IF_EXISTS)
-> 1392 metric_module = metric_module_factory( 1393 path, revision=revision, download_config=download_config,
download_mode=download_mode 1394 ).module_pathmyenv/lib/python3.8/site-packages/datasets/load.py in
metric_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, **download_kwargs) 1322
except Exception as e2: # noqa: if it’s not in the cache, then it
doesn’t exist. 1323 if not isinstance(e1,
FileNotFoundError):
-> 1324 raise e1 from None 1325 raise FileNotFoundError( 1326 f"Couldn’t find a
metric script at {relative_to_absolute_path(combined_path)}. "myenv/lib/python3.8/site-packages/datasets/load.py in
metric_module_factory(path, revision, download_config, download_mode,
force_local_path, dynamic_modules_path, **download_kwargs) 1310
elif is_relative_path(path) and path.count(“/”) == 0 and not
force_local_path: 1311 try:
-> 1312 return GithubMetricModuleFactory( 1313 path, 1314 revision=revision,myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
598 revision = self.revision
599 try:
–> 600 local_path = self.download_loading_script(revision)
601 revision = self.revision
602 except FileNotFoundError:myenv/lib/python3.8/site-packages/datasets/load.py in
download_loading_script(self, revision)
592 if download_config.download_desc is None:
593 download_config.download_desc = “Downloading builder script”
–> 594 return cached_path(file_path, download_config=download_config)
595
596 def get_module(self) -> MetricModule:myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
–> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in
get_from_cache(url, cache_dir, force_download, proxies, etag_timeout,
resume_download, user_agent, local_files_only, use_etag, max_retries,
use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}“)
581 if head_error is not None:
–> 582 raise ConnectionError(f"Couldn’t reach {url} ({repr(head_error)})”)
583 elif response is not None:
584 raise ConnectionError(f"Couldn’t reach {url} (error {response.status_code})")ConnectionError: Couldn’t reach
https://raw.githubusercontent.com/huggingface/datasets/2.0.0/metrics/accuracy/accuracy.py
(ReadTimeout(ReadTimeoutError(“HTTPSConnectionPool(host=‘raw.githubusercontent.com’,
port=443): Read timed out. (read timeout=100)”)))
指标的简单一点,只要把这个Python文件下载到本地(这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文,但是可以参考:PyG的Planetoid无法直接下载Cora等数据集的3个解决方式,然后改为调用这个文件即可:
metric=datasets.load_metric('mypath/accuracy.py')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。