赞
踩
官方文档:multi_gpu_model
以及Google
目前Keras是支持了多个GPU同时训练网络,非常容易,但是靠以下这个代码是不行的。
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
当你监视GPU的使用情况(nvidia-smi -l 1)的时候会发现,尽管GPU不空闲,实质上只有一个GPU在跑,其他的就是闲置的占用状态,也就是说,如果你的电脑里面有多张显卡,无论有没有上面的代码,Keras都会默认的去占用所有能检测到的GPU。这行代码在你只需要一个GPU的时候时候用的,也就是可以让Keras检测不到电脑里其他的GPU。假设你一共有三张显卡,每个显卡都是有自己的标号的(0, 1, 2),为了不影响别人的使用,你只用其中一个,比如用gpu=1的这张,那么
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
然后再监视GPU的使用情况(nvidia-smi -l 1),确实只有一个被占用,其他都是空闲状态。所以这是一个Keras使用多显卡的误区,它并不能同时利用多个GPU。
为什么要同时用多个GPU来训练?
单个显卡内存太小 -> batch size无法设的比较大,有时甚至batch_size=1都内存溢出(OUT OF MEMORY)
从我跑深度网络的经验来看,batch_size设的大一点会比较好,相当于每次反向传播更新权重,网络都可以看到更多的样本,从而不会每次iteration都过拟合到不同的地方去Don’t Decay the Learning Rate, Increase the Batch Size。当然,我也看过有论文说也不能设的过大,原因不明… 反正我也没有机会试过。我建议的batch_size大概就是64~256的范围内,都没什么大问题。
但是随着现在网络的深度越来越深,对于GPU的内存要求也越来越大,很多入门的新人最大的问题往往不是代码,而是从Github里面抄下来的代码自己的GPU太渣,实现不了,只能降低batch_size,最后训练不出那种效果。
解决方案两个:一是买一个超级牛逼的GPU,内存巨大无比;二是买多个一般般的GPU,一起用。
第一个方案不行,因为目前即便最好的NVIDIA显卡,内存也不过十几个G了不起了,网络一深也挂,并且买一个牛逼显卡的性价比不高。所以、学会在Keras下用多个GPU是比较靠谱的选择。
cite: parallel_model.py
import tensorflow as tf import keras.backend as K import keras.layers as KL import keras.models as KM class ParallelModel(KM.Model): """Subclasses the standard Keras Model and adds multi-GPU support. It works by creating a copy of the model on each GPU. Then it slices the inputs and sends a slice to each copy of the model, and then merges the outputs together and applies the loss on the combined outputs. """ def __init__(self, keras_model, gpu_count): """Class constructor. keras_model: The Keras model to parallelize gpu_count: Number of GPUs. Must be > 1 """ self.inner_model = keras_model self.gpu_count = gpu_count merged_outputs = self.make_parallel() super(ParallelModel, self).__init__(inputs=self.inner_model.inputs, outputs=merged_outputs) def __getattribute__(self, attrname): """Redirect loading and saving methods to the inner model. That's where the weights are stored.""" if 'load' in attrname or 'save' in attrname: return getattr(self.inner_model, attrname) return super(ParallelModel, self).__getattribute__(attrname) def summary(self, *args, **kwargs): """Override summary() to display summaries of both, the wrapper and inner models.""" super(ParallelModel, self).summary(*args, **kwargs) self.inner_model.summary(*args, **kwargs) def make_parallel(self): """Creates a new wrapper model that consists of multiple replicas of the original model placed on different GPUs. """ # Slice inputs. Slice inputs on the CPU to avoid sending a copy # of the full inputs to all GPUs. Saves on bandwidth and memory. input_slices = {name: tf.split(x, self.gpu_count) for name, x in zip(self.inner_model.input_names, self.inner_model.inputs)} output_names = self.inner_model.output_names outputs_all = [] for i in range(len(self.inner_model.outputs)): outputs_all.append([]) # Run the model call() on each GPU to place the ops there for i in range(self.gpu_count): with tf.device('/gpu:%d' % i): with tf.name_scope('tower_%d' % i): # Run a slice of inputs through this replica zipped_inputs = zip(self.inner_model.input_names, self.inner_model.inputs) inputs = [ KL.Lambda(lambda s: input_slices[name][i], output_shape=lambda s: (None,) + s[1:])(tensor) for name, tensor in zipped_inputs] # Create the model replica and get the outputs outputs = self.inner_model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save the outputs for merging back together later for l, o in enumerate(outputs): outputs_all[l].append(o) # Merge outputs on CPU with tf.device('/cpu:0'): merged = [] for outputs, name in zip(outputs_all, output_names): # If outputs are numbers without dimensions, add a batch dim. def add_dim(tensor): """Add a dimension to tensors that don't have any.""" if K.int_shape(tensor) == (): return KL.Lambda(lambda t: K.reshape(t, [1, 1]))(tensor) return tensor outputs = list(map(add_dim, outputs)) # Concatenate merged.append(KL.Concatenate(axis=0, name=name)(outputs)) return merged
GPU_COUNT = 3 # 同时使用3个GPU
model = keras.applications.densenet.DenseNet201() # 比如使用DenseNet-201
model = ParallelModel(model, GPU_COUNT)
model.compile(optimizer=Adam(lr=1e-5), loss='binary_crossentropy', metrics = ['accuracy'])
model.fit(X_train, y_train,
batch_size=batch_size*GPU_COUNT,
epochs=nb_epoch, verbose=0, shuffle=True,
validation_data=(X_valid, y_valid))
model.save_weights('/path/to/save/model.h5')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。