赞
踩
用pytorch在训练模型的时候,遇到了奇怪的报错。
ScatterGather,一看似乎出错就在scatter或者gather的操作上。也就是index溢出的问题,看描述:indexValue >= 0 && indexValue < tensor.sizes[dim]
具体的报错如下:
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [1,0,0], thread: [32,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [1,0,0], thread: [33,0,0] AssertionindexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [1,0,0], thread: [34,0,0] AssertionindexValue >= 0 && indexValue < tensor.sizes[dim]
failed.
/opt/conda/conda-
…
顺便查了一下网上的关于这个错误的情况,并没有详细的讲解,所以我仍不能明白我的代码里为什么会出现这个报错。
如下是我的scatter操作代码:
idx = x.topk(mask, dim=1) [1]
y = x.scatter(dim=1, mask_idx, 1e-6)
很明显,我用topk从tensor x取出来index,再用这个index对x进行scatter操作。按照逻辑,是绝对不会出现溢出的情况的。 也就是说这个报错正常情况下,不应该出现。
于是我固定seed,重新训练了一下,并把topk取出来的index打印了出来,如下:
这是正常的index
tensor([[[35], [30], [21]], [[19], [21], [26]], [[ 2], [33], [ 0]], ..., [[35], [27], [26]], [[23], [15], [22]], [[ 0], [33], [13]]], device='cuda:0')
突然出现了一个溢出的index,然后报错
tensor([[[3615207938365080325], [4248763550642949534], [3615207938384285133]], [[3615207938372244161], [3997259572840221815], [3615207938369952754]], [[4172588079432698792], [3615207938209860727], [4265188145389882487]], ..., [[9223372034707292159], [9223372034707292159], [9223372034707292159]], [[9223372034707292159], [9223372034707292159], [9223372034707292159]], [[9223372034707292159], [9223372034707292159], [9223372034707292159]]], device='cuda:0') Traceback (most recent call last): File "main.py", line 117, in <module> run_margin(model, train_loader, optimizer, tracker, train=True, prefix='train', epoch=epoch) File "/home/vqa/bottom-up-attention-vqa/train/train_margin.py", line 106, in run_margin total_loss.backward() File "/home/share/anaconda3/envs/py3_torch_v1.4/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/share/anaconda3/envs/py3_torch_v1.4/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: device-side assert triggered
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。