当前位置:   article > 正文

深度学习训练模型时,GPU显存不够怎么办?

gpu显存不够怎么办

点击上方“小白学视觉”,选择加"星标"或“置顶

重磅干货,第一时间送达

作者丨游客26024@知乎(已授权)

来源丨https://www.zhihu.com/question/461811359/answer/2492822726

编辑丨极市平台

极市导读

 

此篇博文以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。旨为如何让网络模型加速训练,而非去了解其原理。

题外话,我为什么要写这篇博客,就是因为我穷没钱!租的服务器使用多GPU时一会钱就烧没了(gpu内存不用),急需要一种trick,来降低内存加速。

回到正题,如果我们使用的数据集较大,且网络较深,则会造成训练较慢,此时我们要想加速训练可以使用Pytorch的AMPautocast与Gradscaler);本文便是依据此写出的博文,对Pytorch的AMP(autocast与Gradscaler进行对比)自动混合精度对模型训练加速

注意Pytorch1.6+,已经内置torch.cuda.amp,因此便不需要加载NVIDIA的apex库(半精度加速),为方便我们便不使用NVIDIA的apex库(安装麻烦),转而使用torch.cuda.amp

AMP (Automatic mixed precision): 自动混合精度,那什么是自动混合精度

先来梳理一下历史:先有NVIDIA的apex,之后NVIDIA的开发人员将其贡献到Pytorch 1.6+产生了torch.cuda.amp[这是笔者梳理,可能有误,请留言]

详细讲:默认情况下,大多数深度学习框架都采用32位浮点算法进行训练。2017年,NVIDIA研究了一种用于混合精度训练的方法(apex),该方法在训练网络时将单精度(FP32)与半精度(FP16)结合在一起,并使用相同的超参数实现了与FP32几乎相同的精度,且速度比之前快了不少

之后,来到了AMP时代(特指torch.cuda.amp),此有两个关键词:自动混合精度(Pytorch 1.6+中的torch.cuda.amp)其中,自动表现在Tensor的dtype类型会自动变化,框架按需自动调整tensor的dtype,可能有些地方需要手动干预;混合精度表现在采用不止一种精度的Tensor, torch.FloatTensor与torch.HalfTensor。并且从名字可以看出torch.cuda.amp,这个功能只能在cuda上使用

为什么我们要使用AMP自动混合精度?

1.减少显存占用(FP16优势)

2.加快训练和推断的计算(FP16优势)

3.张量核心的普及(NVIDIA Tensor Core),低精度(FP16优势)

4. 混合精度训练缓解舍入误差问题,(FP16有此劣势,但是FP32可以避免此)

5.损失放大,可能使用混合精度还会出现无法收敛的问题[其原因时激活梯度值较小],造成了溢出,则可以通过使用torch.cuda.amp.GradScaler放大损失来防止梯度的下溢

申明此篇博文主旨如何让网络模型加速训练,而非去了解其原理,且其以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。使用的电脑是2060的拯救者,虽然渣,但是还是可以搞搞这些测试。

本文从1.没使用DDP与DP训练与评估代码(之后加入amp),2.分布式DP训练与评估代码(之后加入amp),3.单进程占用多卡DDP训练与评估代码(之后加入amp) 角度讲解。

运行此程序时,文件的结构:

  1. D:/PycharmProject/Simple-CV-Pytorch-master
  2. |
  3. |
  4. |
  5. |----AMP(train_without.py、train_DP.py、train_autocast.py、train_GradScaler.py、eval_XXX.py
  6. |等,之后加入的alexnet也在这里,alexnet.py)
  7. |
  8. |
  9. |
  10. |----tensorboard(保存tensorboard的文件夹)
  11. |
  12. |
  13. |
  14. |----checkpoint(保存模型的文件夹)
  15. |
  16. |
  17. |
  18. |----data(数据集所在文件夹)

1.没使用DDP与DP训练与评估代码

没使用DDP与DP的训练与评估实验,作为我们实验的参照组

(1)原本模型的训练与评估源码:

训练源码:

注意:此段代码无比简陋,仅为代码的雏形,大致能理解尚可!

train_without.py

  1. import time
  2. import torch
  3. import torchvision
  4. from torch import nn
  5. from torch.utils.data import DataLoader
  6. from torchvision.models import alexnet
  7. from torchvision import transforms
  8. from torch.utils.tensorboard import SummaryWriter
  9. import numpy as np
  10. import argparse
  11. def parse_args():
  12. parser = argparse.ArgumentParser(description='CV Train')
  13. parser.add_mutually_exclusive_group()
  14. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  15. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  16. parser.add_argument('--img_size', type=int, default=227, help='image size')
  17. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  18. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  19. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  20. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  21. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  22. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  23. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  24. return parser.parse_args()
  25. args = parse_args()
  26. # 1.Create SummaryWriter
  27. if args.tensorboard:
  28. writer = SummaryWriter(args.tensorboard_log)
  29. # 2.Ready dataset
  30. if args.dataset == 'CIFAR10':
  31. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  32. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  33. else:
  34. raise ValueError("Dataset is not CIFAR10")
  35. cuda = torch.cuda.is_available()
  36. print('CUDA available: {}'.format(cuda))
  37. # 3.Length
  38. train_dataset_size = len(train_dataset)
  39. print("the train dataset size is {}".format(train_dataset_size))
  40. # 4.DataLoader
  41. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  42. # 5.Create model
  43. model = alexnet()
  44. if args.cuda == cuda:
  45. model = model.cuda()
  46. # 6.Create loss
  47. cross_entropy_loss = nn.CrossEntropyLoss()
  48. # 7.Optimizer
  49. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  50. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  51. # 8. Set some parameters to control loop
  52. # epoch
  53. iter = 0
  54. t0 = time.time()
  55. for epoch in range(args.epochs):
  56. t1 = time.time()
  57. print(" -----------------the {} number of training epoch --------------".format(epoch))
  58. model.train()
  59. for data in train_dataloader:
  60. loss = 0
  61. imgs, targets = data
  62. if args.cuda == cuda:
  63. cross_entropy_loss = cross_entropy_loss.cuda()
  64. imgs, targets = imgs.cuda(), targets.cuda()
  65. outputs = model(imgs)
  66. loss_train = cross_entropy_loss(outputs, targets)
  67. loss = loss_train.item() + loss
  68. if args.tensorboard:
  69. writer.add_scalar("train_loss", loss_train.item(), iter)
  70. optim.zero_grad()
  71. loss_train.backward()
  72. optim.step()
  73. iter = iter + 1
  74. if iter % 100 == 0:
  75. print(
  76. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  77. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  78. np.mean(loss)))
  79. if args.tensorboard:
  80. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  81. scheduler.step(np.mean(loss))
  82. t2 = time.time()
  83. h = (t2 - t1) // 3600
  84. m = ((t2 - t1) % 3600) // 60
  85. s = ((t2 - t1) % 3600) % 60
  86. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  87. if epoch % 1 == 0:
  88. print("Save state, iter: {} ".format(epoch))
  89. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  90. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  91. t3 = time.time()
  92. h_t = (t3 - t0) // 3600
  93. m_t = ((t3 - t0) % 3600) // 60
  94. s_t = ((t3 - t0) % 3600) // 60
  95. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  96. if args.tensorboard:
  97. writer.close()

运行结果:

30aa626509d35f8f96f6bb1d9dd848e6.jpeg 376de9524dede231b30839125beb62b2.jpeg

Tensorboard观察:

7ebfc1cd8617d8e1a75266eba3417902.jpeg 543ea92802e1a1bd8bef8c02932ca603.jpeg

评估源码:

代码特别粗犷,尤其是device与精度计算,仅供参考,切勿模仿!

eval_without.py

  1. import torch
  2. import torchvision
  3. from torch.utils.data import DataLoader
  4. from torchvision.transforms import transforms
  5. from alexnet import alexnet
  6. import argparse
  7. # eval
  8. def parse_args():
  9. parser = argparse.ArgumentParser(description='CV Evaluation')
  10. parser.add_mutually_exclusive_group()
  11. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  12. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  13. parser.add_argument('--img_size', type=int, default=227, help='image size')
  14. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  15. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  16. return parser.parse_args()
  17. args = parse_args()
  18. # 1.Create model
  19. model = alexnet()
  20. # 2.Ready Dataset
  21. if args.dataset == 'CIFAR10':
  22. test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
  23. transform=transforms.Compose(
  24. [transforms.Resize(args.img_size),
  25. transforms.ToTensor()]),
  26. download=True)
  27. else:
  28. raise ValueError("Dataset is not CIFAR10")
  29. # 3.Length
  30. test_dataset_size = len(test_dataset)
  31. print("the test dataset size is {}".format(test_dataset_size))
  32. # 4.DataLoader
  33. test_dataloader = DataLoader(dataset=test_dataset, batch_size=args.batch_size)
  34. # 5. Set some parameters for testing the network
  35. total_accuracy = 0
  36. # test
  37. model.eval()
  38. with torch.no_grad():
  39. for data in test_dataloader:
  40. imgs, targets = data
  41. device = torch.device('cpu')
  42. imgs, targets = imgs.to(device), targets.to(device)
  43. model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
  44. model.load_state_dict(model_load)
  45. outputs = model(imgs)
  46. outputs = outputs.to(device)
  47. accuracy = (outputs.argmax(1) == targets).sum()
  48. total_accuracy = total_accuracy + accuracy
  49. accuracy = total_accuracy / test_dataset_size
  50. print("the total accuracy is {}".format(accuracy))

运行结果:

23bb06c9bc8cc2812b7a7aaa0fc581c5.jpeg

分析:

原本模型训练完20个epochs花费了22分22秒,得到的准确率为0.8191

(2)原本模型加入autocast的训练与评估源码:

训练源码:

训练大致代码流程:

  1. from torch.cuda.amp import autocast as autocast
  2. ...
  3. # Create model, default torch.FloatTensor
  4. model = Net().cuda()
  5. # SGD,Adm, Admw,...
  6. optim = optim.XXX(model.parameters(),..)
  7. ...
  8. for imgs,targets in dataloader:
  9. imgs,targets = imgs.cuda(),targets.cuda()
  10. ....
  11. with autocast():
  12. outputs = model(imgs)
  13. loss = loss_fn(outputs,targets)
  14. ...
  15. optim.zero_grad()
  16. loss.backward()
  17. optim.step()
  18. ...

train_autocast_without.py

  1. import time
  2. import torch
  3. import torchvision
  4. from torch import nn
  5. from torch.cuda.amp import autocast
  6. from torchvision import transforms
  7. from torchvision.models import alexnet
  8. from torch.utils.data import DataLoader
  9. from torch.utils.tensorboard import SummaryWriter
  10. import numpy as np
  11. import argparse
  12. def parse_args():
  13. parser = argparse.ArgumentParser(description='CV Train')
  14. parser.add_mutually_exclusive_group()
  15. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  16. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  17. parser.add_argument('--img_size', type=int, default=227, help='image size')
  18. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  19. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  20. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  21. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  22. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  23. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  24. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  25. return parser.parse_args()
  26. args = parse_args()
  27. # 1.Create SummaryWriter
  28. if args.tensorboard:
  29. writer = SummaryWriter(args.tensorboard_log)
  30. # 2.Ready dataset
  31. if args.dataset == 'CIFAR10':
  32. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  33. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  34. else:
  35. raise ValueError("Dataset is not CIFAR10")
  36. cuda = torch.cuda.is_available()
  37. print('CUDA available: {}'.format(cuda))
  38. # 3.Length
  39. train_dataset_size = len(train_dataset)
  40. print("the train dataset size is {}".format(train_dataset_size))
  41. # 4.DataLoader
  42. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  43. # 5.Create model
  44. model = alexnet()
  45. if args.cuda == cuda:
  46. model = model.cuda()
  47. # 6.Create loss
  48. cross_entropy_loss = nn.CrossEntropyLoss()
  49. # 7.Optimizer
  50. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  51. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  52. # 8. Set some parameters to control loop
  53. # epoch
  54. iter = 0
  55. t0 = time.time()
  56. for epoch in range(args.epochs):
  57. t1 = time.time()
  58. print(" -----------------the {} number of training epoch --------------".format(epoch))
  59. model.train()
  60. for data in train_dataloader:
  61. loss = 0
  62. imgs, targets = data
  63. if args.cuda == cuda:
  64. cross_entropy_loss = cross_entropy_loss.cuda()
  65. imgs, targets = imgs.cuda(), targets.cuda()
  66. with autocast():
  67. outputs = model(imgs)
  68. loss_train = cross_entropy_loss(outputs, targets)
  69. loss = loss_train.item() + loss
  70. if args.tensorboard:
  71. writer.add_scalar("train_loss", loss_train.item(), iter)
  72. optim.zero_grad()
  73. loss_train.backward()
  74. optim.step()
  75. iter = iter + 1
  76. if iter % 100 == 0:
  77. print(
  78. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  79. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  80. np.mean(loss)))
  81. if args.tensorboard:
  82. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  83. scheduler.step(np.mean(loss))
  84. t2 = time.time()
  85. h = (t2 - t1) // 3600
  86. m = ((t2 - t1) % 3600) // 60
  87. s = ((t2 - t1) % 3600) % 60
  88. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  89. if epoch % 1 == 0:
  90. print("Save state, iter: {} ".format(epoch))
  91. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  92. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  93. t3 = time.time()
  94. h_t = (t3 - t0) // 3600
  95. m_t = ((t3 - t0) % 3600) // 60
  96. s_t = ((t3 - t0) % 3600) // 60
  97. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  98. if args.tensorboard:
  99. writer.close()

运行结果:

11e219be0d405c9dcc9a2ca9dc425aba.jpeg 9f97c180ee5efda20afc054d24f53c5d.jpeg

Tensorboard观察:

c5e5386aeeb4f040cd8d90fb78ad7596.jpeg db180d836b93c89d4ef137228157fddb.jpeg

评估源码:

eval_without.py 和 1.(1)一样

运行结果:

a76a95c19e2a0343730b046f26f15716.jpeg

分析:

原本模型训练完20个epochs花费了22分22秒,加入autocast之后模型花费的时间为21分21秒,说明模型速度增加了,并且准确率从之前的0.8191提升到0.8403

(3)原本模型加入autocast与GradScaler的训练与评估源码:

使用torch.cuda.amp.GradScaler是放大损失值来防止梯度的下溢

训练源码:

训练大致代码流程:

  1. from torch.cuda.amp import autocast as autocast
  2. from torch.cuda.amp import GradScaler as GradScaler
  3. ...
  4. # Create model, default torch.FloatTensor
  5. model = Net().cuda()
  6. # SGD,Adm, Admw,...
  7. optim = optim.XXX(model.parameters(),..)
  8. scaler = GradScaler()
  9. ...
  10. for imgs,targets in dataloader:
  11. imgs,targets = imgs.cuda(),targets.cuda()
  12. ...
  13. optim.zero_grad()
  14. ....
  15. with autocast():
  16. outputs = model(imgs)
  17. loss = loss_fn(outputs,targets)
  18. scaler.scale(loss).backward()
  19. scaler.step(optim)
  20. scaler.update()
  21. ...

train_GradScaler_without.py

  1. import time
  2. import torch
  3. import torchvision
  4. from torch import nn
  5. from torch.cuda.amp import autocast, GradScaler
  6. from torchvision import transforms
  7. from torchvision.models import alexnet
  8. from torch.utils.data import DataLoader
  9. from torch.utils.tensorboard import SummaryWriter
  10. import numpy as np
  11. import argparse
  12. def parse_args():
  13. parser = argparse.ArgumentParser(description='CV Train')
  14. parser.add_mutually_exclusive_group()
  15. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  16. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  17. parser.add_argument('--img_size', type=int, default=227, help='image size')
  18. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  19. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  20. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  21. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  22. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  23. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  24. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  25. return parser.parse_args()
  26. args = parse_args()
  27. # 1.Create SummaryWriter
  28. if args.tensorboard:
  29. writer = SummaryWriter(args.tensorboard_log)
  30. # 2.Ready dataset
  31. if args.dataset == 'CIFAR10':
  32. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  33. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  34. else:
  35. raise ValueError("Dataset is not CIFAR10")
  36. cuda = torch.cuda.is_available()
  37. print('CUDA available: {}'.format(cuda))
  38. # 3.Length
  39. train_dataset_size = len(train_dataset)
  40. print("the train dataset size is {}".format(train_dataset_size))
  41. # 4.DataLoader
  42. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  43. # 5.Create model
  44. model = alexnet()
  45. if args.cuda == cuda:
  46. model = model.cuda()
  47. # 6.Create loss
  48. cross_entropy_loss = nn.CrossEntropyLoss()
  49. # 7.Optimizer
  50. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  51. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  52. scaler = GradScaler()
  53. # 8. Set some parameters to control loop
  54. # epoch
  55. iter = 0
  56. t0 = time.time()
  57. for epoch in range(args.epochs):
  58. t1 = time.time()
  59. print(" -----------------the {} number of training epoch --------------".format(epoch))
  60. model.train()
  61. for data in train_dataloader:
  62. loss = 0
  63. imgs, targets = data
  64. optim.zero_grad()
  65. if args.cuda == cuda:
  66. cross_entropy_loss = cross_entropy_loss.cuda()
  67. imgs, targets = imgs.cuda(), targets.cuda()
  68. with autocast():
  69. outputs = model(imgs)
  70. loss_train = cross_entropy_loss(outputs, targets)
  71. loss = loss_train.item() + loss
  72. if args.tensorboard:
  73. writer.add_scalar("train_loss", loss_train.item(), iter)
  74. scaler.scale(loss_train).backward()
  75. scaler.step(optim)
  76. scaler.update()
  77. iter = iter + 1
  78. if iter % 100 == 0:
  79. print(
  80. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  81. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  82. np.mean(loss)))
  83. if args.tensorboard:
  84. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  85. scheduler.step(np.mean(loss))
  86. t2 = time.time()
  87. h = (t2 - t1) // 3600
  88. m = ((t2 - t1) % 3600) // 60
  89. s = ((t2 - t1) % 3600) % 60
  90. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  91. if epoch % 1 == 0:
  92. print("Save state, iter: {} ".format(epoch))
  93. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  94. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  95. t3 = time.time()
  96. h_t = (t3 - t0) // 3600
  97. m_t = ((t3 - t0) % 3600) // 60
  98. s_t = ((t3 - t0) % 3600) // 60
  99. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  100. if args.tensorboard:
  101. writer.close()

运行结果:

e48c8c92a03208d989fe5c2c70284270.jpeg bea7e94e7d0cffaed31ee4acd7106506.jpeg

Tensorboard观察:

fd696dda0cd658b6c25a613287af965c.jpeg 3107c999fe3be0371ab3682b5b7d1e01.jpeg

评估源码:

eval_without.py 和 1.(1)一样

运行结果:

bfa9634dc25c482c2fe7caad62c03442.jpeg

分析:

为什么,我们训练完20个epochs花费了27分27秒,比之前原模型未使用任何amp的时间(22分22秒)都多了?

这是因为我们使用了GradScaler放大了损失降低了模型训练的速度,还有个原因可能是笔者自身的显卡太小,没有起到加速的作用

2.分布式DP训练与评估代码

(1)DP原本模型的训练与评估源码:

训练源码:

train_DP.py

  1. import time
  2. import torch
  3. import torchvision
  4. from torch import nn
  5. from torch.utils.data import DataLoader
  6. from torchvision.models import alexnet
  7. from torchvision import transforms
  8. from torch.utils.tensorboard import SummaryWriter
  9. import numpy as np
  10. import argparse
  11. def parse_args():
  12. parser = argparse.ArgumentParser(description='CV Train')
  13. parser.add_mutually_exclusive_group()
  14. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  15. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  16. parser.add_argument('--img_size', type=int, default=227, help='image size')
  17. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  18. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  19. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  20. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  21. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  22. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  23. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  24. return parser.parse_args()
  25. args = parse_args()
  26. # 1.Create SummaryWriter
  27. if args.tensorboard:
  28. writer = SummaryWriter(args.tensorboard_log)
  29. # 2.Ready dataset
  30. if args.dataset == 'CIFAR10':
  31. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  32. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  33. else:
  34. raise ValueError("Dataset is not CIFAR10")
  35. cuda = torch.cuda.is_available()
  36. print('CUDA available: {}'.format(cuda))
  37. # 3.Length
  38. train_dataset_size = len(train_dataset)
  39. print("the train dataset size is {}".format(train_dataset_size))
  40. # 4.DataLoader
  41. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  42. # 5.Create model
  43. model = alexnet()
  44. if args.cuda == cuda:
  45. model = model.cuda()
  46. model = torch.nn.DataParallel(model).cuda()
  47. else:
  48. model = torch.nn.DataParallel(model)
  49. # 6.Create loss
  50. cross_entropy_loss = nn.CrossEntropyLoss()
  51. # 7.Optimizer
  52. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  53. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  54. # 8. Set some parameters to control loop
  55. # epoch
  56. iter = 0
  57. t0 = time.time()
  58. for epoch in range(args.epochs):
  59. t1 = time.time()
  60. print(" -----------------the {} number of training epoch --------------".format(epoch))
  61. model.train()
  62. for data in train_dataloader:
  63. loss = 0
  64. imgs, targets = data
  65. if args.cuda == cuda:
  66. cross_entropy_loss = cross_entropy_loss.cuda()
  67. imgs, targets = imgs.cuda(), targets.cuda()
  68. outputs = model(imgs)
  69. loss_train = cross_entropy_loss(outputs, targets)
  70. loss = loss_train.item() + loss
  71. if args.tensorboard:
  72. writer.add_scalar("train_loss", loss_train.item(), iter)
  73. optim.zero_grad()
  74. loss_train.backward()
  75. optim.step()
  76. iter = iter + 1
  77. if iter % 100 == 0:
  78. print(
  79. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  80. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  81. np.mean(loss)))
  82. if args.tensorboard:
  83. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  84. scheduler.step(np.mean(loss))
  85. t2 = time.time()
  86. h = (t2 - t1) // 3600
  87. m = ((t2 - t1) % 3600) // 60
  88. s = ((t2 - t1) % 3600) % 60
  89. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  90. if epoch % 1 == 0:
  91. print("Save state, iter: {} ".format(epoch))
  92. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  93. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  94. t3 = time.time()
  95. h_t = (t3 - t0) // 3600
  96. m_t = ((t3 - t0) % 3600) // 60
  97. s_t = ((t3 - t0) % 3600) // 60
  98. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  99. if args.tensorboard:
  100. writer.close()

运行结果:

145244898d3bc86a29471b6936d8fbef.jpeg 1e74d6a3048a5de96d314cbc27f93801.jpeg

Tensorboard观察:

dd013bdee4d3bcce10c41d5fa9f371cd.jpeg cf279ec91ed3f252e3b8bb0df1f54df1.jpeg

评估源码:

eval_DP.py

  1. import torch
  2. import torchvision
  3. from torch.utils.data import DataLoader
  4. from torchvision.transforms import transforms
  5. from alexnet import alexnet
  6. import argparse
  7. # eval
  8. def parse_args():
  9. parser = argparse.ArgumentParser(description='CV Evaluation')
  10. parser.add_mutually_exclusive_group()
  11. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  12. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  13. parser.add_argument('--img_size', type=int, default=227, help='image size')
  14. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  15. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  16. return parser.parse_args()
  17. args = parse_args()
  18. # 1.Create model
  19. model = alexnet()
  20. model = torch.nn.DataParallel(model)
  21. # 2.Ready Dataset
  22. if args.dataset == 'CIFAR10':
  23. test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
  24. transform=transforms.Compose(
  25. [transforms.Resize(args.img_size),
  26. transforms.ToTensor()]),
  27. download=True)
  28. else:
  29. raise ValueError("Dataset is not CIFAR10")
  30. # 3.Length
  31. test_dataset_size = len(test_dataset)
  32. print("the test dataset size is {}".format(test_dataset_size))
  33. # 4.DataLoader
  34. test_dataloader = DataLoader(dataset=test_dataset, batch_size=args.batch_size)
  35. # 5. Set some parameters for testing the network
  36. total_accuracy = 0
  37. # test
  38. model.eval()
  39. with torch.no_grad():
  40. for data in test_dataloader:
  41. imgs, targets = data
  42. device = torch.device('cpu')
  43. imgs, targets = imgs.to(device), targets.to(device)
  44. model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
  45. model.load_state_dict(model_load)
  46. outputs = model(imgs)
  47. outputs = outputs.to(device)
  48. accuracy = (outputs.argmax(1) == targets).sum()
  49. total_accuracy = total_accuracy + accuracy
  50. accuracy = total_accuracy / test_dataset_size
  51. print("the total accuracy is {}".format(accuracy))

运行结果:

b440808411626261e52c05bbbdba7d44.jpeg

(2)DP使用autocast的训练与评估源码:

训练源码:

如果你这样写代码,那么你的代码无效!!!

  1. ...
  2. model = Model()
  3. model = torch.nn.DataParallel(model)
  4. ...
  5. with autocast():
  6. output = model(imgs)
  7. loss = loss_fn(output)

正确写法,训练大致流程代码:

  1. 1.Model(nn.Module):
  2. @autocast()
  3. def forward(self, input):
  4. ...
  5. 2.Model(nn.Module):
  6. def foward(self, input):
  7. with autocast():
  8. ...

1与2皆可,之后:

  1. ...
  2. model = Model()
  3. model = torch.nn.DataParallel(model)
  4. with autocast():
  5. output = model(imgs)
  6. loss = loss_fn(output)
  7. ...

模型:

须在forward函数上加入@autocast()或者在forward里面最上面加入with autocast():

alexnet.py

  1. import torch
  2. import torch.nn as nn
  3. from torchvision.models.utils import load_state_dict_from_url
  4. from torch.cuda.amp import autocast
  5. from typing import Any
  6. __all__ = ['AlexNet', 'alexnet']
  7. model_urls = {
  8. 'alexnet': 'https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth',
  9. }
  10. class AlexNet(nn.Module):
  11. def __init__(self, num_classes: int = 1000) -> None:
  12. super(AlexNet, self).__init__()
  13. self.features = nn.Sequential(
  14. nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
  15. nn.ReLU(inplace=True),
  16. nn.MaxPool2d(kernel_size=3, stride=2),
  17. nn.Conv2d(64, 192, kernel_size=5, padding=2),
  18. nn.ReLU(inplace=True),
  19. nn.MaxPool2d(kernel_size=3, stride=2),
  20. nn.Conv2d(192, 384, kernel_size=3, padding=1),
  21. nn.ReLU(inplace=True),
  22. nn.Conv2d(384, 256, kernel_size=3, padding=1),
  23. nn.ReLU(inplace=True),
  24. nn.Conv2d(256, 256, kernel_size=3, padding=1),
  25. nn.ReLU(inplace=True),
  26. nn.MaxPool2d(kernel_size=3, stride=2),
  27. )
  28. self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
  29. self.classifier = nn.Sequential(
  30. nn.Dropout(),
  31. nn.Linear(256 * 6 * 6, 4096),
  32. nn.ReLU(inplace=True),
  33. nn.Dropout(),
  34. nn.Linear(4096, 4096),
  35. nn.ReLU(inplace=True),
  36. nn.Linear(4096, num_classes),
  37. )
  38. @autocast()
  39. def forward(self, x: torch.Tensor) -> torch.Tensor:
  40. x = self.features(x)
  41. x = self.avgpool(x)
  42. x = torch.flatten(x, 1)
  43. x = self.classifier(x)
  44. return x
  45. def alexnet(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> AlexNet:
  46. r"""AlexNet model architecture from the
  47. `"One weird trick..." <https://arxiv.org/abs/1404.5997>`_ paper.
  48. Args:
  49. pretrained (bool): If True, returns a model pre-trained on ImageNet
  50. progress (bool): If True, displays a progress bar of the download to stderr
  51. """
  52. model = AlexNet(**kwargs)
  53. if pretrained:
  54. state_dict = load_state_dict_from_url(model_urls["alexnet"],
  55. progress=progress)
  56. model.load_state_dict(state_dict)
  57. return model

train_DP_autocast.py 导入自己的alexnet.py

  1. import time
  2. import torch
  3. from alexnet import alexnet
  4. import torchvision
  5. from torch import nn
  6. from torch.utils.data import DataLoader
  7. from torchvision import transforms
  8. from torch.cuda.amp import autocast as autocast
  9. from torch.utils.tensorboard import SummaryWriter
  10. import numpy as np
  11. import argparse
  12. def parse_args():
  13. parser = argparse.ArgumentParser(description='CV Train')
  14. parser.add_mutually_exclusive_group()
  15. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  16. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  17. parser.add_argument('--img_size', type=int, default=227, help='image size')
  18. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  19. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  20. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  21. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  22. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  23. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  24. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  25. return parser.parse_args()
  26. args = parse_args()
  27. # 1.Create SummaryWriter
  28. if args.tensorboard:
  29. writer = SummaryWriter(args.tensorboard_log)
  30. # 2.Ready dataset
  31. if args.dataset == 'CIFAR10':
  32. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  33. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  34. else:
  35. raise ValueError("Dataset is not CIFAR10")
  36. cuda = torch.cuda.is_available()
  37. print('CUDA available: {}'.format(cuda))
  38. # 3.Length
  39. train_dataset_size = len(train_dataset)
  40. print("the train dataset size is {}".format(train_dataset_size))
  41. # 4.DataLoader
  42. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  43. # 5.Create model
  44. model = alexnet()
  45. if args.cuda == cuda:
  46. model = model.cuda()
  47. model = torch.nn.DataParallel(model).cuda()
  48. else:
  49. model = torch.nn.DataParallel(model)
  50. # 6.Create loss
  51. cross_entropy_loss = nn.CrossEntropyLoss()
  52. # 7.Optimizer
  53. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  54. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  55. # 8. Set some parameters to control loop
  56. # epoch
  57. iter = 0
  58. t0 = time.time()
  59. for epoch in range(args.epochs):
  60. t1 = time.time()
  61. print(" -----------------the {} number of training epoch --------------".format(epoch))
  62. model.train()
  63. for data in train_dataloader:
  64. loss = 0
  65. imgs, targets = data
  66. if args.cuda == cuda:
  67. cross_entropy_loss = cross_entropy_loss.cuda()
  68. imgs, targets = imgs.cuda(), targets.cuda()
  69. with autocast():
  70. outputs = model(imgs)
  71. loss_train = cross_entropy_loss(outputs, targets)
  72. loss = loss_train.item() + loss
  73. if args.tensorboard:
  74. writer.add_scalar("train_loss", loss_train.item(), iter)
  75. optim.zero_grad()
  76. loss_train.backward()
  77. optim.step()
  78. iter = iter + 1
  79. if iter % 100 == 0:
  80. print(
  81. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  82. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  83. np.mean(loss)))
  84. if args.tensorboard:
  85. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  86. scheduler.step(np.mean(loss))
  87. t2 = time.time()
  88. h = (t2 - t1) // 3600
  89. m = ((t2 - t1) % 3600) // 60
  90. s = ((t2 - t1) % 3600) % 60
  91. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  92. if epoch % 1 == 0:
  93. print("Save state, iter: {} ".format(epoch))
  94. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  95. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  96. t3 = time.time()
  97. h_t = (t3 - t0) // 3600
  98. m_t = ((t3 - t0) % 3600) // 60
  99. s_t = ((t3 - t0) % 3600) // 60
  100. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  101. if args.tensorboard:
  102. writer.close()

运行结果:

065e1d8b984b2f351b873e7aa543a917.jpeg bb059d779d0ca53995f794746ffa3494.jpeg

Tensorboard观察:

41bd1764511b5e3aa28843e810344605.jpeg 6bd74693f8dc6c32aa72d79d5ba44f42.jpeg

评估源码:

eval_DP.py 相比与2. (1)导入自己的alexnet.py

运行结果:

7798f3f0cac02ae1c9af0d04d83a7098.jpeg

分析:

可以看出DP使用autocast训练完20个epochs时需要花费的时间是21分21秒,相比与之前DP没有使用的时间(22分22秒)快了1分1秒

之前DP未使用amp能达到准确率0.8216,而现在准确率降低到0.8188,说明还是使用自动混合精度加速还是对模型的准确率有所影响,后期可通过增大batch_sizel让运行时间和之前一样,但是准确率上升,来降低此影响

(3)DP使用autocast与GradScaler的训练与评估源码:

训练源码:

train_DP_GradScaler.py 导入自己的alexnet.py

  1. import time
  2. import torch
  3. from alexnet import alexnet
  4. import torchvision
  5. from torch import nn
  6. from torch.utils.data import DataLoader
  7. from torchvision import transforms
  8. from torch.cuda.amp import autocast as autocast
  9. from torch.cuda.amp import GradScaler as GradScaler
  10. from torch.utils.tensorboard import SummaryWriter
  11. import numpy as np
  12. import argparse
  13. def parse_args():
  14. parser = argparse.ArgumentParser(description='CV Train')
  15. parser.add_mutually_exclusive_group()
  16. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  17. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  18. parser.add_argument('--img_size', type=int, default=227, help='image size')
  19. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  20. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  21. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  22. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  23. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  24. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  25. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  26. return parser.parse_args()
  27. args = parse_args()
  28. # 1.Create SummaryWriter
  29. if args.tensorboard:
  30. writer = SummaryWriter(args.tensorboard_log)
  31. # 2.Ready dataset
  32. if args.dataset == 'CIFAR10':
  33. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  34. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  35. else:
  36. raise ValueError("Dataset is not CIFAR10")
  37. cuda = torch.cuda.is_available()
  38. print('CUDA available: {}'.format(cuda))
  39. # 3.Length
  40. train_dataset_size = len(train_dataset)
  41. print("the train dataset size is {}".format(train_dataset_size))
  42. # 4.DataLoader
  43. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
  44. # 5.Create model
  45. model = alexnet()
  46. if args.cuda == cuda:
  47. model = model.cuda()
  48. model = torch.nn.DataParallel(model).cuda()
  49. else:
  50. model = torch.nn.DataParallel(model)
  51. # 6.Create loss
  52. cross_entropy_loss = nn.CrossEntropyLoss()
  53. # 7.Optimizer
  54. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  55. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  56. scaler = GradScaler()
  57. # 8. Set some parameters to control loop
  58. # epoch
  59. iter = 0
  60. t0 = time.time()
  61. for epoch in range(args.epochs):
  62. t1 = time.time()
  63. print(" -----------------the {} number of training epoch --------------".format(epoch))
  64. model.train()
  65. for data in train_dataloader:
  66. loss = 0
  67. imgs, targets = data
  68. optim.zero_grad()
  69. if args.cuda == cuda:
  70. cross_entropy_loss = cross_entropy_loss.cuda()
  71. imgs, targets = imgs.cuda(), targets.cuda()
  72. with autocast():
  73. outputs = model(imgs)
  74. loss_train = cross_entropy_loss(outputs, targets)
  75. loss = loss_train.item() + loss
  76. if args.tensorboard:
  77. writer.add_scalar("train_loss", loss_train.item(), iter)
  78. scaler.scale(loss_train).backward()
  79. scaler.step(optim)
  80. scaler.update()
  81. iter = iter + 1
  82. if iter % 100 == 0:
  83. print(
  84. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  85. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  86. np.mean(loss)))
  87. if args.tensorboard:
  88. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  89. scheduler.step(np.mean(loss))
  90. t2 = time.time()
  91. h = (t2 - t1) // 3600
  92. m = ((t2 - t1) % 3600) // 60
  93. s = ((t2 - t1) % 3600) % 60
  94. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  95. if epoch % 1 == 0:
  96. print("Save state, iter: {} ".format(epoch))
  97. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  98. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  99. t3 = time.time()
  100. h_t = (t3 - t0) // 3600
  101. m_t = ((t3 - t0) % 3600) // 60
  102. s_t = ((t3 - t0) % 3600) // 60
  103. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  104. if args.tensorboard:
  105. writer.close()

运行结果:

6b8d533c6551096f7a895b5b266661e9.jpeg 74de466fd614d54a051d0925d3a1a255.jpeg

Tensorboard观察:

25b6ba00a00304cab3183fb1623c3376.jpeg dcdaab86b7a4bafaabdf1e3fa5b1ba13.jpeg

评估源码:

eval_DP.py 相比与2. (1)导入自己的alexnet.py

运行结果:

310a8f2c174b0fce70f9e96742f4a89b.jpeg

分析:

跟之前一样,DP使用了GradScaler放大了损失降低了模型训练的速度

现在DP使用了autocast与GradScaler的准确率为0.8409,相比与DP只使用autocast准确率0.8188还是有所上升,并且之前DP未使用amp是准确率(0.8216)也提高了不少

3.单进程占用多卡DDP训练与评估代码

(1)DDP原模型训练与评估源码:

训练源码:

train_DDP.py

  1. import time
  2. import torch
  3. from torchvision.models.alexnet import alexnet
  4. import torchvision
  5. from torch import nn
  6. import torch.distributed as dist
  7. from torchvision import transforms
  8. from torch.utils.data import DataLoader
  9. from torch.utils.tensorboard import SummaryWriter
  10. import numpy as np
  11. import argparse
  12. def parse_args():
  13. parser = argparse.ArgumentParser(description='CV Train')
  14. parser.add_mutually_exclusive_group()
  15. parser.add_argument("--rank", type=int, default=0)
  16. parser.add_argument("--world_size", type=int, default=1)
  17. parser.add_argument("--master_addr", type=str, default="127.0.0.1")
  18. parser.add_argument("--master_port", type=str, default="12355")
  19. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  20. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  21. parser.add_argument('--img_size', type=int, default=227, help='image size')
  22. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  23. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  24. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  25. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  26. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  27. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  28. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  29. return parser.parse_args()
  30. args = parse_args()
  31. def train():
  32. dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
  33. rank=args.rank,
  34. world_size=args.world_size)
  35. # 1.Create SummaryWriter
  36. if args.tensorboard:
  37. writer = SummaryWriter(args.tensorboard_log)
  38. # 2.Ready dataset
  39. if args.dataset == 'CIFAR10':
  40. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  41. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  42. else:
  43. raise ValueError("Dataset is not CIFAR10")
  44. cuda = torch.cuda.is_available()
  45. print('CUDA available: {}'.format(cuda))
  46. # 3.Length
  47. train_dataset_size = len(train_dataset)
  48. print("the train dataset size is {}".format(train_dataset_size))
  49. train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
  50. # 4.DataLoader
  51. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
  52. num_workers=2,
  53. pin_memory=True)
  54. # 5.Create model
  55. model = alexnet()
  56. if args.cuda == cuda:
  57. model = model.cuda()
  58. model = torch.nn.parallel.DistributedDataParallel(model).cuda()
  59. else:
  60. model = torch.nn.parallel.DistributedDataParallel(model)
  61. # 6.Create loss
  62. cross_entropy_loss = nn.CrossEntropyLoss()
  63. # 7.Optimizer
  64. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  65. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  66. # 8. Set some parameters to control loop
  67. # epoch
  68. iter = 0
  69. t0 = time.time()
  70. for epoch in range(args.epochs):
  71. t1 = time.time()
  72. print(" -----------------the {} number of training epoch --------------".format(epoch))
  73. model.train()
  74. for data in train_dataloader:
  75. loss = 0
  76. imgs, targets = data
  77. if args.cuda == cuda:
  78. cross_entropy_loss = cross_entropy_loss.cuda()
  79. imgs, targets = imgs.cuda(), targets.cuda()
  80. outputs = model(imgs)
  81. loss_train = cross_entropy_loss(outputs, targets)
  82. loss = loss_train.item() + loss
  83. if args.tensorboard:
  84. writer.add_scalar("train_loss", loss_train.item(), iter)
  85. optim.zero_grad()
  86. loss_train.backward()
  87. optim.step()
  88. iter = iter + 1
  89. if iter % 100 == 0:
  90. print(
  91. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  92. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  93. np.mean(loss)))
  94. if args.tensorboard:
  95. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  96. scheduler.step(np.mean(loss))
  97. t2 = time.time()
  98. h = (t2 - t1) // 3600
  99. m = ((t2 - t1) % 3600) // 60
  100. s = ((t2 - t1) % 3600) % 60
  101. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  102. if epoch % 1 == 0:
  103. print("Save state, iter: {} ".format(epoch))
  104. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  105. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  106. t3 = time.time()
  107. h_t = (t3 - t0) // 3600
  108. m_t = ((t3 - t0) % 3600) // 60
  109. s_t = ((t3 - t0) % 3600) // 60
  110. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  111. if args.tensorboard:
  112. writer.close()
  113. if __name__ == "__main__":
  114. local_size = torch.cuda.device_count()
  115. print("local_size: ".format(local_size))
  116. train()

运行结果:

ff6576acd54d11f0c3c402a2caf675d5.jpeg e03dfbb12ba42433b60140639f68785a.jpeg

Tensorboard观察:

50977e5715772f1c70a94eda0ea37c8b.jpeg d4cc6a32f9ca9fa78f45f4ffdfdc74a8.jpeg

评估源码:

eval_DDP.py

  1. import torch
  2. import torchvision
  3. import torch.distributed as dist
  4. from torch.utils.data import DataLoader
  5. from torchvision.transforms import transforms
  6. # from alexnet import alexnet
  7. from torchvision.models.alexnet import alexnet
  8. import argparse
  9. # eval
  10. def parse_args():
  11. parser = argparse.ArgumentParser(description='CV Evaluation')
  12. parser.add_mutually_exclusive_group()
  13. parser.add_argument("--rank", type=int, default=0)
  14. parser.add_argument("--world_size", type=int, default=1)
  15. parser.add_argument("--master_addr", type=str, default="127.0.0.1")
  16. parser.add_argument("--master_port", type=str, default="12355")
  17. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  18. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  19. parser.add_argument('--img_size', type=int, default=227, help='image size')
  20. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  21. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  22. return parser.parse_args()
  23. args = parse_args()
  24. def eval():
  25. dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
  26. rank=args.rank,
  27. world_size=args.world_size)
  28. # 1.Create model
  29. model = alexnet()
  30. model = torch.nn.parallel.DistributedDataParallel(model)
  31. # 2.Ready Dataset
  32. if args.dataset == 'CIFAR10':
  33. test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
  34. transform=transforms.Compose(
  35. [transforms.Resize(args.img_size),
  36. transforms.ToTensor()]),
  37. download=True)
  38. else:
  39. raise ValueError("Dataset is not CIFAR10")
  40. # 3.Length
  41. test_dataset_size = len(test_dataset)
  42. print("the test dataset size is {}".format(test_dataset_size))
  43. test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
  44. # 4.DataLoader
  45. test_dataloader = DataLoader(dataset=test_dataset, sampler=test_sampler, batch_size=args.batch_size,
  46. num_workers=2,
  47. pin_memory=True)
  48. # 5. Set some parameters for testing the network
  49. total_accuracy = 0
  50. # test
  51. model.eval()
  52. with torch.no_grad():
  53. for data in test_dataloader:
  54. imgs, targets = data
  55. device = torch.device('cpu')
  56. imgs, targets = imgs.to(device), targets.to(device)
  57. model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
  58. model.load_state_dict(model_load)
  59. outputs = model(imgs)
  60. outputs = outputs.to(device)
  61. accuracy = (outputs.argmax(1) == targets).sum()
  62. total_accuracy = total_accuracy + accuracy
  63. accuracy = total_accuracy / test_dataset_size
  64. print("the total accuracy is {}".format(accuracy))
  65. if __name__ == "__main__":
  66. local_size = torch.cuda.device_count()
  67. print("local_size: ".format(local_size))
  68. eval()

运行结果:

9a8eb57e6a05572a381456442fdff286.jpeg

(2)DDP使用autocast的训练与评估源码:

训练源码:

train_DDP_autocast.py 导入自己的alexnet.py

  1. import time
  2. import torch
  3. from alexnet import alexnet
  4. import torchvision
  5. from torch import nn
  6. import torch.distributed as dist
  7. from torchvision import transforms
  8. from torch.utils.data import DataLoader
  9. from torch.cuda.amp import autocast as autocast
  10. from torch.utils.tensorboard import SummaryWriter
  11. import numpy as np
  12. import argparse
  13. def parse_args():
  14. parser = argparse.ArgumentParser(description='CV Train')
  15. parser.add_mutually_exclusive_group()
  16. parser.add_argument("--rank", type=int, default=0)
  17. parser.add_argument("--world_size", type=int, default=1)
  18. parser.add_argument("--master_addr", type=str, default="127.0.0.1")
  19. parser.add_argument("--master_port", type=str, default="12355")
  20. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  21. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  22. parser.add_argument('--img_size', type=int, default=227, help='image size')
  23. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  24. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  25. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  26. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  27. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  28. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  29. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  30. return parser.parse_args()
  31. args = parse_args()
  32. def train():
  33. dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
  34. rank=args.rank,
  35. world_size=args.world_size)
  36. # 1.Create SummaryWriter
  37. if args.tensorboard:
  38. writer = SummaryWriter(args.tensorboard_log)
  39. # 2.Ready dataset
  40. if args.dataset == 'CIFAR10':
  41. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  42. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  43. else:
  44. raise ValueError("Dataset is not CIFAR10")
  45. cuda = torch.cuda.is_available()
  46. print('CUDA available: {}'.format(cuda))
  47. # 3.Length
  48. train_dataset_size = len(train_dataset)
  49. print("the train dataset size is {}".format(train_dataset_size))
  50. train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
  51. # 4.DataLoader
  52. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
  53. num_workers=2,
  54. pin_memory=True)
  55. # 5.Create model
  56. model = alexnet()
  57. if args.cuda == cuda:
  58. model = model.cuda()
  59. model = torch.nn.parallel.DistributedDataParallel(model).cuda()
  60. else:
  61. model = torch.nn.parallel.DistributedDataParallel(model)
  62. # 6.Create loss
  63. cross_entropy_loss = nn.CrossEntropyLoss()
  64. # 7.Optimizer
  65. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  66. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  67. # 8. Set some parameters to control loop
  68. # epoch
  69. iter = 0
  70. t0 = time.time()
  71. for epoch in range(args.epochs):
  72. t1 = time.time()
  73. print(" -----------------the {} number of training epoch --------------".format(epoch))
  74. model.train()
  75. for data in train_dataloader:
  76. loss = 0
  77. imgs, targets = data
  78. if args.cuda == cuda:
  79. cross_entropy_loss = cross_entropy_loss.cuda()
  80. imgs, targets = imgs.cuda(), targets.cuda()
  81. with autocast():
  82. outputs = model(imgs)
  83. loss_train = cross_entropy_loss(outputs, targets)
  84. loss = loss_train.item() + loss
  85. if args.tensorboard:
  86. writer.add_scalar("train_loss", loss_train.item(), iter)
  87. optim.zero_grad()
  88. loss_train.backward()
  89. optim.step()
  90. iter = iter + 1
  91. if iter % 100 == 0:
  92. print(
  93. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  94. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  95. np.mean(loss)))
  96. if args.tensorboard:
  97. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  98. scheduler.step(np.mean(loss))
  99. t2 = time.time()
  100. h = (t2 - t1) // 3600
  101. m = ((t2 - t1) % 3600) // 60
  102. s = ((t2 - t1) % 3600) % 60
  103. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  104. if epoch % 1 == 0:
  105. print("Save state, iter: {} ".format(epoch))
  106. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  107. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  108. t3 = time.time()
  109. h_t = (t3 - t0) // 3600
  110. m_t = ((t3 - t0) % 3600) // 60
  111. s_t = ((t3 - t0) % 3600) // 60
  112. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  113. if args.tensorboard:
  114. writer.close()
  115. if __name__ == "__main__":
  116. local_size = torch.cuda.device_count()
  117. print("local_size: ".format(local_size))
  118. train()

运行结果:

f9a9ee93723966af31d820a142f04bea.jpeg 83c31645c6057eb82541c18258943e11.jpeg

Tensorboard观察:

eedaf9af23a013a8af9f11995fb335ac.jpeg c1a867bd471b8158b96385f105d83097.jpeg

评估源码:

eval_DDP.py 导入自己的alexnet.py

  1. import torch
  2. import torchvision
  3. import torch.distributed as dist
  4. from torch.utils.data import DataLoader
  5. from torchvision.transforms import transforms
  6. from alexnet import alexnet
  7. # from torchvision.models.alexnet import alexnet
  8. import argparse
  9. # eval
  10. def parse_args():
  11. parser = argparse.ArgumentParser(description='CV Evaluation')
  12. parser.add_mutually_exclusive_group()
  13. parser.add_argument("--rank", type=int, default=0)
  14. parser.add_argument("--world_size", type=int, default=1)
  15. parser.add_argument("--master_addr", type=str, default="127.0.0.1")
  16. parser.add_argument("--master_port", type=str, default="12355")
  17. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  18. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  19. parser.add_argument('--img_size', type=int, default=227, help='image size')
  20. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  21. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  22. return parser.parse_args()
  23. args = parse_args()
  24. def eval():
  25. dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
  26. rank=args.rank,
  27. world_size=args.world_size)
  28. # 1.Create model
  29. model = alexnet()
  30. model = torch.nn.parallel.DistributedDataParallel(model)
  31. # 2.Ready Dataset
  32. if args.dataset == 'CIFAR10':
  33. test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
  34. transform=transforms.Compose(
  35. [transforms.Resize(args.img_size),
  36. transforms.ToTensor()]),
  37. download=True)
  38. else:
  39. raise ValueError("Dataset is not CIFAR10")
  40. # 3.Length
  41. test_dataset_size = len(test_dataset)
  42. print("the test dataset size is {}".format(test_dataset_size))
  43. test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
  44. # 4.DataLoader
  45. test_dataloader = DataLoader(dataset=test_dataset, sampler=test_sampler, batch_size=args.batch_size,
  46. num_workers=2,
  47. pin_memory=True)
  48. # 5. Set some parameters for testing the network
  49. total_accuracy = 0
  50. # test
  51. model.eval()
  52. with torch.no_grad():
  53. for data in test_dataloader:
  54. imgs, targets = data
  55. device = torch.device('cpu')
  56. imgs, targets = imgs.to(device), targets.to(device)
  57. model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
  58. model.load_state_dict(model_load)
  59. outputs = model(imgs)
  60. outputs = outputs.to(device)
  61. accuracy = (outputs.argmax(1) == targets).sum()
  62. total_accuracy = total_accuracy + accuracy
  63. accuracy = total_accuracy / test_dataset_size
  64. print("the total accuracy is {}".format(accuracy))
  65. if __name__ == "__main__":
  66. local_size = torch.cuda.device_count()
  67. print("local_size: ".format(local_size))
  68. eval()

运行结果:

0520145065e8ec9ef8f1f43bc1699ff2.jpeg

分析:

从DDP未使用amp花费21分21秒,DDP使用autocast花费20分20秒,说明速度提升了

DDP未使用amp的准确率0.8224,之后DDP使用了autocast准确率下降到0.8162

(3)DDP使用autocast与GradScaler的训练与评估源码

训练源码:

train_DDP_GradScaler.py 导入自己的alexnet.py

  1. import time
  2. import torch
  3. from alexnet import alexnet
  4. import torchvision
  5. from torch import nn
  6. import torch.distributed as dist
  7. from torchvision import transforms
  8. from torch.utils.data import DataLoader
  9. from torch.cuda.amp import autocast as autocast
  10. from torch.cuda.amp import GradScaler as GradScaler
  11. from torch.utils.tensorboard import SummaryWriter
  12. import numpy as np
  13. import argparse
  14. def parse_args():
  15. parser = argparse.ArgumentParser(description='CV Train')
  16. parser.add_mutually_exclusive_group()
  17. parser.add_argument("--rank", type=int, default=0)
  18. parser.add_argument("--world_size", type=int, default=1)
  19. parser.add_argument("--master_addr", type=str, default="127.0.0.1")
  20. parser.add_argument("--master_port", type=str, default="12355")
  21. parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
  22. parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
  23. parser.add_argument('--img_size', type=int, default=227, help='image size')
  24. parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
  25. parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
  26. parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
  27. parser.add_argument('--batch_size', type=int, default=64, help='batch size')
  28. parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
  29. parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
  30. parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
  31. return parser.parse_args()
  32. args = parse_args()
  33. def train():
  34. dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
  35. rank=args.rank,
  36. world_size=args.world_size)
  37. # 1.Create SummaryWriter
  38. if args.tensorboard:
  39. writer = SummaryWriter(args.tensorboard_log)
  40. # 2.Ready dataset
  41. if args.dataset == 'CIFAR10':
  42. train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
  43. [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
  44. else:
  45. raise ValueError("Dataset is not CIFAR10")
  46. cuda = torch.cuda.is_available()
  47. print('CUDA available: {}'.format(cuda))
  48. # 3.Length
  49. train_dataset_size = len(train_dataset)
  50. print("the train dataset size is {}".format(train_dataset_size))
  51. train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
  52. # 4.DataLoader
  53. train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
  54. num_workers=2,
  55. pin_memory=True)
  56. # 5.Create model
  57. model = alexnet()
  58. if args.cuda == cuda:
  59. model = model.cuda()
  60. model = torch.nn.parallel.DistributedDataParallel(model).cuda()
  61. else:
  62. model = torch.nn.parallel.DistributedDataParallel(model)
  63. # 6.Create loss
  64. cross_entropy_loss = nn.CrossEntropyLoss()
  65. # 7.Optimizer
  66. optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
  67. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
  68. scaler = GradScaler()
  69. # 8. Set some parameters to control loop
  70. # epoch
  71. iter = 0
  72. t0 = time.time()
  73. for epoch in range(args.epochs):
  74. t1 = time.time()
  75. print(" -----------------the {} number of training epoch --------------".format(epoch))
  76. model.train()
  77. for data in train_dataloader:
  78. loss = 0
  79. imgs, targets = data
  80. optim.zero_grad()
  81. if args.cuda == cuda:
  82. cross_entropy_loss = cross_entropy_loss.cuda()
  83. imgs, targets = imgs.cuda(), targets.cuda()
  84. with autocast():
  85. outputs = model(imgs)
  86. loss_train = cross_entropy_loss(outputs, targets)
  87. loss = loss_train.item() + loss
  88. if args.tensorboard:
  89. writer.add_scalar("train_loss", loss_train.item(), iter)
  90. scaler.scale(loss_train).backward()
  91. scaler.step(optim)
  92. scaler.update()
  93. iter = iter + 1
  94. if iter % 100 == 0:
  95. print(
  96. "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
  97. .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
  98. np.mean(loss)))
  99. if args.tensorboard:
  100. writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
  101. scheduler.step(np.mean(loss))
  102. t2 = time.time()
  103. h = (t2 - t1) // 3600
  104. m = ((t2 - t1) % 3600) // 60
  105. s = ((t2 - t1) % 3600) % 60
  106. print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
  107. if epoch % 1 == 0:
  108. print("Save state, iter: {} ".format(epoch))
  109. torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
  110. torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
  111. t3 = time.time()
  112. h_t = (t3 - t0) // 3600
  113. m_t = ((t3 - t0) % 3600) // 60
  114. s_t = ((t3 - t0) % 3600) // 60
  115. print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
  116. if args.tensorboard:
  117. writer.close()
  118. if __name__ == "__main__":
  119. local_size = torch.cuda.device_count()
  120. print("local_size: ".format(local_size))
  121. train()

运行结果:

9db69b3648a430895a9445ae03cf2579.jpeg 418e59c25f72b4583f207f1f914c606f.jpeg

Tensorboard观察:

e422b625c024b294db77d3033ab374e4.jpeg 11151bed790f79206d6e3910b9d3b7c8.jpeg

评估源码:

eval_DDP.py 与3. (2) 一样,导入自己的alexnet.py

运行结果:

f0fd086fb0693f855a7136055e9972e0.jpeg

分析:

运行起来了,速度也比DDP未使用amp(用时21分21秒)快了不少(用时20分20秒),之前DDP未使用amp准确率到达0.8224,现在DDP使用了autocast与GradScaler的准确率达到0.8252,提升了

参考:

1.Pytorch自动混合精度(AMP)训练:https://blog.csdn.net/ytusdc/article/details/122152244

2.PyTorch分布式训练基础--DDP使用:https://zhuanlan.zhihu.com/p/358974461

  1. 下载1:OpenCV-Contrib扩展模块中文版教程
  2. 在「小白学视觉」公众号后台回复:扩展模块中文教程,即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。
  3. 下载2:Python视觉实战项目52
  4. 在「小白学视觉」公众号后台回复:Python视觉实战项目,即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。
  5. 下载3:OpenCV实战项目20
  6. 在「小白学视觉」公众号后台回复:OpenCV实战项目20讲,即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。
  7. 交流群
  8. 欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/1008553
推荐阅读
相关标签
  

闽ICP备14008679号