赞
踩
处理新版的waymo数据集已经很费劲了,结果eval的结果和train的loss总是很差,原本以为只是model的问题,后面发现环境也有大坑。修好了之后,evaluate结束又开始报error了,明明结果都对的,非要报个error,一系列的事情忙了20天才弄好,中间基本没休息过,累死了。
前面配mmdet3d的时候,由于使用了最新版mmdet3d v1.0.0rc2,导致使用官方的config和model,nuscenes数据集上的eval和train结果都不对,后面用了同学环境的版本才好了,但这个时候测waymo就会报错,找了很久bug,才发现新版cuda,旧版torch和tensorflow存在一定程度的冲突,以至于一起用显卡的时候会出现问题。
有空交个issue
环境:
mmcv-full 1.4.0
mmdet 2.19.1
mmdet3d 0.17.3
tensorflow 2.6.0
torch 1.10.2+cu113
waymo-open-dataset-tf-2-6-0 1.4.7
问题出现在bash tools/dist_train.sh或者bash tools/dist_test.sh。
一旦调用过waymo_dataset.evaluate,程序结束都会报错:
terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
Exception raised from create_event_internal at
…/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42
(0x7f27dd18ed62 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f282083b5f3 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2
(0x7f282083c002 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f27dd178314
in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29eb89 (0x7f28a3b62b89 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadfbe1 (0x7f28a43a3be1 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292
(0x7f28a43a3ee2 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #61: PyRun_SimpleFileExFlags + 0x1bf (0x56231eaba54f in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #62:
Py_RunMain + 0x3a9 (0x56231eabaa29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python) frame #63:
Py_BytesMain + 0x39 (0x56231eabac29 in
/home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python)WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57875 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process
57876 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode:
-6) local_rank: 0 (pid: 57874) of binary: /home/zhengliangtao/anaconda3/envs/open-mmlab/bin/python Traceback
(most recent call last): File
“/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 194, in _run_module_as_main
return _run_code(code, main_globals, None, File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py”,
line 87, in _run_code
exec(code, run_globals) File “/home/zhengliangtao/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py”,
line 193, in
main() File “/home/zhengliangta
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。