当前位置:   article > 正文

Pytorch下的dataloader加速_datanet加速

datanet加速

项目场景:

训练模型的时候,显卡使用效率不高,需要等待num_worker都加载完数据之后才能进行训练。也就是在CPU上等待空闲的时间较长导致了显卡的使用效率下降。


期待的方案:

加载数据的同时可以训练模型。


原因分析:

在加载数据需要处理的时间比较长时,dataloader需要在所有worker加载完数据之后才会进行训练,并且worker之间的并不是并行处理数据的。


解决方案:

方案1.预处理数据,预先将数据保存为HDF5格式,但是据说worker超过2时会报错;
class dataset_h5(torch.utils.data.Dataset):
def __init__(self, in_file):
    super(dataset_h5, self).__init__()

    self.file = h5py.File(in_file, 'r')
    self.n_images, self.nx, self.ny = self.file['images'].shape

def __getitem__(self, index):
    input = self.file['images'][index,:,:]
    return input.astype('float32')

def __len__(self):
    return self.n_images
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13


方案2.Nvidia DALI (https://github.com/NVIDIA/DALI) 配上一段pytorch论坛上的描述:

It’s got simple to use Pytorch integration.
I was running into the same problems with the pytorch dataloader. On ImageNet, I couldn’t seem to get above about 250 images/sec. On a Google cloud instance with 12 cores & a V100, I could get just over 2000 images/sec with DALI. However in cases where the dataloader isn’t the bottleneck, I found that using DALI would impact performance 5-10%. This makes sense I think, as you’re using the GPU to some of the decoding & preprocessing
Edit: Dali also has a CPU only mode, meaning no GPU performance hit

此方案解决的是数据加载过程中的时长问题,并不是我们期待的解决方案,但要是能提高效率也不是不可



方案3.一段有趣的使用torch.cuda.stream的实现

class data_prefetcher():
    def __init__(self, loader):
        self.loader = iter(loader)
        self.stream = torch.cuda.Stream()
        self.mean = torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255]).cuda().view(1,3,1,1)
        self.std = torch.tensor([0.229 * 255, 0.224 * 255, 0.225 * 255]).cuda().view(1,3,1,1)
        # With Amp, it isn't necessary to manually convert data to half.
        # if args.fp16:
        #     self.mean = self.mean.half()
        #     self.std = self.std.half()
        self.preload()

    def preload(self):
        try:
            self.next_input, self.next_target = next(self.loader)
        except StopIteration:
            self.next_input = None
            self.next_target = None
            return
        with torch.cuda.stream(self.stream):
            self.next_input = self.next_input.cuda(non_blocking=True)
            self.next_target = self.next_target.cuda(non_blocking=True)
            # With Amp, it isn't necessary to manually convert data to half.
            # if args.fp16:
            #     self.next_input = self.next_input.half()
            # else:
            self.next_input = self.next_input.float()
            self.next_input = self.next_input.sub_(self.mean).div_(self.std)
            
    def next(self):
        torch.cuda.current_stream().wait_stream(self.stream)
        input = self.next_input
        target = self.next_target
        self.preload()
        return input, target

train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler, collate_fn=fast_collate)
        
def train(train_loader, model, criterion, optimizer, epoch):
    # switch to train mode
    model.train()
    end = time.time()

    prefetcher = data_prefetcher(train_loader)
    input, target = prefetcher.next()
    i = 0
    while input is not None:
       ‘’‘
        backward
       ’‘’

        input, target = prefetcher.next()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54

cuda流:CUDA流表示一个GPU操作队列,该队列中的操作将以添加到流中的先后顺序而依次执行。可以将一个流看做是GPU上的一个任务,不同任务可以并行执行。使用CUDA流,首先要选择一个支持设备重叠(Device Overlap)功能的设备,支持设备重叠功能的GPU能够在执行一个CUDA核函数的同时,还能在主机和设备之间执行复制数据操作。
支持重叠功能的设备的这一特性很重要,可以在一定程度上提升GPU程序的执行效率。一般情况下,CPU内存远大于GPU内存,对于数据量比较大的情况,不可能把CPU缓冲区中的数据一次性传输给GPU,需要分块传输,如果能够在分块传输的同时,GPU也在执行核函数运算,这样的异步操作,就用到设备的重叠功能,能够提高运算性能。


参考:
https://discuss.pytorch.org/t/how-to-speed-up-the-data-loader/13740/21
https://blog.csdn.net/dcrmg/article/details/55107518

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号