当前位置:   article > 正文

mmdet3d+waymo 踩坑+验证环境正确性流程_torch.distributed.elastic.multiprocessing.errors.c

torch.distributed.elastic.multiprocessing.errors.childfailederror:

处理新版的waymo数据集已经很费劲了,结果eval的结果和train的loss总是很差,原本以为只是model的问题,后面发现环境也有大坑。修好了之后,evaluate结束又开始报error了,明明结果都对的,非要报个error,一系列的事情忙了20天才弄好,中间基本没休息过,累死了。

前面配mmdet3d的时候,由于使用了最新版mmdet3d v1.0.0rc2,导致使用官方的config和model,nuscenes数据集上的eval和train结果都不对,后面用了同学环境的版本才好了,但这个时候测waymo就会报错,找了很久bug,才发现新版cuda,旧版torch和tensorflow存在一定程度的冲突,以至于一起用显卡的时候会出现问题。

waymo evaluate error复现

有空交个issue
环境:

mmcv-full                 1.4.0            
mmdet                     2.19.1     
mmdet3d                   0.17.3 
tensorflow                2.6.0
torch                     1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate,程序结束都会报错:

terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangta

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/羊村懒王/article/detail/390413
推荐阅读
相关标签
  

闽ICP备14008679号