赞
踩
在Atlas800上使用单Ascend910卡跑[model_zoo里的Bert-Thor](https://gitee.com/mindspore/mindspore/tree/r1.0/model_zoo/official/nlp/bert_thor)
遇见如下问题:
```
epoch: 30, step: 29800, outputs are [18.077267], total_time_span is 20.93736958503723, step_time_span is 0.21148858166704274
epoch: 30, step: 29801, outputs are [11.092405], total_time_span is 0.5619204044342041, step_time_span is 0.5619204044342041
epoch: 30, step: 29900, outputs are [10.516598], total_time_span is 20.941352367401123, step_time_span is 0.21152881179193053
epoch: 30, step: 29901, outputs are [12.539895], total_time_span is 0.5634746551513672, step_time_span is 0.5634746551513672
epoch: 30, step: 30000, outputs are [3.313356e+13], total_time_span is 20.768070459365845, step_time_span is 0.20977848948854388
Epoch time: 225452.528, per step time: 225.453
epoch: 31, step: 30001, outputs are [2.1173654e+13], total_time_span is 0.5580630302429199, step_time_span is 0.5580630302429199
epoch: 31, step: 30100, outputs are [5.174134e+13], total_time_span is 20.744428873062134, step_time_span is 0.2095396855864862
epoch: 31, step: 30101, outputs are [4.44978e+13], total_time_span is 0.559708833694458, step_time_span is 0.559708833694458
epoch: 31, step: 30200, outputs are [8.108476e+13], total_time_span is 20.74749517440796, step_time_span is 0.2095706583273531
epoch: 31, step: 30201, outputs are [9.724299e+13], total_time_span is 0.5609047412872314, step_time_span is 0.5609047412872314
epoch: 31, step: 30300, outputs are [1.0650323e+14], total_time_span is 20.749968767166138, step_time_span is 0.20959564411278928
epoch: 31, step: 30301, outputs are [7.854555e+13], total_time_span is 0.56308913230896, step_time_span is 0.56308913230896
epoch: 31, step: 30400, outputs are [8.488058e+13], total_time_span is 20.76317572593689, step_time_span is 0.20972904773673626
epoch: 31, step: 30401, outputs are [1.2049275e+14], total_time_span is 0.5600569248199463, step_time_span is 0.5600569248199463
[ERROR] RUNTIME(63050)model execute error, error code=0x91, [the model stream execute failed].
[ERROR] RUNTIME(63050)model execute task failed, device_id=0, model stream_id=553, model task_id=611, model_id=513, first_task_id=65535
[ERROR] RUNTIME(63050)aicore kernel execute failed, device_id=0, stream_id=575, task_id=31, fault kernel_name=unsorted_segment_sum_d_13348636282719136312_0__kernel0, func_name=unsorted_segment_sum_d_13348636282719136312_0__kernel0
Traceback (most recent call last):
File "run_pretrain.py", line 207, in <module>
run_pretrain()
File "run_pretrain.py", line 202, in run_pretrain
sink_size=args_opt.data_sink_steps)
File "/home/xupx/wuzw/workspace/bert_thor/scripts/LOG/src/model_thor.py", line 618, in train
sink_size=sink_size)
File "/home/xupx/wuzw/workspace/bert_thor/scripts/LOG/src/model_thor.py", line 409, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/xupx/wuzw/workspace/bert_thor/scripts/LOG/src/model_thor.py", line 490, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 280, in __call__
out = self.compile_and_run(*inputs)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 544, in compile_and_run
return _executor(self, *inputs, phase=self.phase)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/common/api.py", line 475, in __call__
return self.run(obj, *args, phase=phase)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/common/api.py", line 503, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/common/api.py", line 69, in wrapper
results = fn(*arg, **kwargs)
File "/home/xupx/wuzw/software/archiconda3/envs/mindspore/lib/python3.7/site-packages/mindspore/common/api.py", line 486, in _exec_pip
return self._executor(args_list, phase)
RuntimeError: mindspore/ccsrc/backend/session/ascend_session.cc:572 Execute] run task error!
```
从epoch: 30, step: 30000开始,outputs数值不合理,然后报错,请问这可能是什么原因?
另:
1. 如何下载wikimedia 20200101的数据?目前dump-wiki最老的也是20201001的数据
2. 在跑pretrain_eval.py使用的参数DATA_FILE,FINETUNE_CKPT怎么设置?export FINETUNE_CKPT=***?
3. SCHEMA_DIR所涉及的json文件怎么生成?
解答:
从epoch: 30, step: 30000开始,outputs数值不合理,然后报错,请问这可能是什么原因?
您好,问题原因应该是learning_rate的step默认的是30000步,您可以在lr_generator.py中调整learning_rate和damping这两个超参。
另:
1. 如何下载wikimedia 20200101的数据?目前dump-wiki最老的也是20201001的数据
答:dump-wiki确实是实时更新的,您可以下载最新的wikimedia语料库进行实验。
2. 在跑pretrain_eval.py使用的参数DATA_FILE,FINETUNE_CKPT怎么设置?export FINETUNE_CKPT=***?
答:您可以在bert_thor/src/evaluation_config.py脚本中设置参数DATA_FILE,FINETUNE_CKPT。
3. SCHEMA_DIR所涉及的json文件怎么生成?
答:这个参数您可以不用关注,直接入参""。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。