赞
踩
本文分享自华为云社区《CNN-VIT 视频动态手势识别【玩转华为云】》,作者: HouYanSong。
人工智能的发展日新月异,也深刻的影响到人机交互领域的发展。手势动作作为一种自然、快捷的交互方式,在智能驾驶、虚拟现实等领域有着广泛的应用。手势识别的任务是,当操作者做出某个手势动作后,计算机能够快速准确的判断出该手势的类型。本文将使用ModelArts开发训练一个视频动态手势识别的算法模型,对上滑、下滑、左滑、右滑、打开、关闭等动态手势类别进行检测,实现类似华为手机隔空手势的功能。
CNN-VIT 视频动态手势识别算法首先使用预训练网络InceptionResNetV2逐帧提取视频动作片段特征,然后输入Transformer Encoder进行分类。我们使用动态手势识别样例数据集对算法进行测试,总共包含108段视频,数据集包含无效手势、上滑、下滑、左滑、右滑、打开、关闭等7种手势的视频,具体操作流程如下:
首先我们将采集的视频文件解码抽取关键帧,每隔4帧保存一次,然后对图像进行中心裁剪和预处理,代码如下:
- def load_video(file_name):
- cap = cv2.VideoCapture(file_name)
- # 每隔多少帧抽取一次
- frame_interval = 4
- frames = []
- count = 0
- while True:
- ret, frame = cap.read()
- if not ret:
- break
-
- # 每隔frame_interval帧保存一次
- if count % frame_interval == 0:
- # 中心裁剪
- frame = crop_center_square(frame)
- # 缩放
- frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE))
- # BGR -> RGB [0,1,2] -> [2,1,0]
- frame = frame[:, :, [2, 1, 0]]
- frames.append(frame)
- count += 1
-
- return np.array(frames)
然后我们创建图像特征提取器,使用预训练模型InceptionResNetV2提取图像特征,代码如下:
- def get_feature_extractor():
- feature_extractor = keras.applications.inception_resnet_v2.InceptionResNetV2(
- weights = 'imagenet',
- include_top = False,
- pooling = 'avg',
- input_shape = (IMG_SIZE, IMG_SIZE, 3)
- )
-
- preprocess_input = keras.applications.inception_resnet_v2.preprocess_input
-
- inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
- preprocessed = preprocess_input(inputs)
- outputs = feature_extractor(preprocessed)
-
- model = keras.Model(inputs, outputs, name = 'feature_extractor')
-
- return model
接着提取视频特征向量,如果视频不足40帧就创建全0数组进行补白:
- def load_data(videos, labels):
-
- video_features = []
-
- for video in tqdm(videos):
- frames = load_video(video)
- counts = len(frames)
- # 如果帧数小于MAX_SEQUENCE_LENGTH
- if counts < MAX_SEQUENCE_LENGTH:
- # 补白
- diff = MAX_SEQUENCE_LENGTH - counts
- # 创建全0的numpy数组
- padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
- # 数组拼接
- frames = np.concatenate((frames, padding))
- # 获取前MAX_SEQUENCE_LENGTH帧画面
- frames = frames[:MAX_SEQUENCE_LENGTH, :]
- # 批量提取特征
- video_feature = feature_extractor.predict(frames)
- video_features.append(video_feature)
-
- return np.array(video_features), np.array(labels)
最后创建VIT Model,代码如下:
- # 位置编码
- class PositionalEmbedding(layers.Layer):
- def __init__(self, seq_length, output_dim):
- super().__init__()
- # 构造从0~MAX_SEQUENCE_LENGTH的列表
- self.positions = tf.range(0, limit=MAX_SEQUENCE_LENGTH)
- self.positional_embedding = layers.Embedding(input_dim=seq_length, output_dim=output_dim)
-
- def call(self,x):
- # 位置编码
- positions_embedding = self.positional_embedding(self.positions)
- # 输入相加
- return x + positions_embedding
-
- # 编码器
- class TransformerEncoder(layers.Layer):
-
- def __init__(self, num_heads, embed_dim):
- super().__init__()
- self.p_embedding = PositionalEmbedding(MAX_SEQUENCE_LENGTH, NUM_FEATURES)
- self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim, dropout=0.1)
- self.layernorm = layers.LayerNormalization()
-
- def call(self,x):
- # positional embedding
- positional_embedding = self.p_embedding(x)
- # self attention
- attention_out = self.attention(
- query = positional_embedding,
- value = positional_embedding,
- key = positional_embedding,
- attention_mask = None
- )
- # layer norm with residual connection
- output = self.layernorm(positional_embedding + attention_out)
- return output
-
- def video_cls_model(class_vocab):
- # 类别数量
- classes_num = len(class_vocab)
- # 定义模型
- model = keras.Sequential([
- layers.InputLayer(input_shape=(MAX_SEQUENCE_LENGTH, NUM_FEATURES)),
- TransformerEncoder(2, NUM_FEATURES),
- layers.GlobalMaxPooling1D(),
- layers.Dropout(0.1),
- layers.Dense(classes_num, activation="softmax")
- ])
- # 编译模型
- model.compile(optimizer = keras.optimizers.Adam(1e-5),
- loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False),
- metrics = ['accuracy']
- )
- return model
完整体验可以点击Run in ModelArts一键运行我发布的Notebook:
最终模型在整个数据集上的准确率达到87%,即在小数据集上训练取得了较为不错的结果。
首先加载VIT Model,获取视频类别索引标签:
- import random
- # 加载模型
- model = tf.keras.models.load_model('saved_model')
- # 类别标签
- label_to_name = {0:'无效手势', 1:'上滑', 2:'下滑', 3:'左滑', 4:'右滑', 5:'打开', 6:'关闭', 7:'放大', 8:'缩小'}
然后使用图像特征提取器InceptionResNetV2提取视频特征:
- # 获取视频特征
- def getVideoFeat(frames):
-
- frames_count = len(frames)
-
- # 如果帧数小于MAX_SEQUENCE_LENGTH
- if frames_count < MAX_SEQUENCE_LENGTH:
- # 补白
- diff = MAX_SEQUENCE_LENGTH - frames_count
- # 创建全0的numpy数组
- padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
- # 数组拼接
- frames = np.concatenate((frames, padding))
-
- # 取前MAX_SEQ_LENGTH帧
- frames = frames[:MAX_SEQUENCE_LENGTH,:]
- # 计算视频特征 N, 1536
- video_feat = feature_extractor.predict(frames)
-
- return video_feat
最后将视频序列的特征向量输入Transformer Encoder进行预测:
- # 视频预测
- def testVideo():
- test_file = random.sample(videos, 1)[0]
- label = test_file.split('_')[-2]
-
- print('文件名:{}'.format(test_file) )
- print('真实类别:{}'.format(label_to_name.get(int(label))) )
-
- # 读取视频每一帧
- frames = load_video(test_file)
- # 挑选前帧MAX_SEQUENCE_LENGTH显示
- frames = frames[:MAX_SEQUENCE_LENGTH].astype(np.uint8)
- # 保存为GIF
- imageio.mimsave('animation.gif', frames, duration=10)
- # 获取特征
- feat = getVideoFeat(frames)
- # 模型推理
- prob = model.predict(tf.expand_dims(feat, axis=0))[0]
-
- print('预测类别:')
- for i in np.argsort(prob)[::-1][:5]:
- print('{}: {}%'.format(label_to_name[i], round(prob[i]*100, 2)))
-
- return display(Image(open('animation.gif', 'rb').read()))
模型预测结果:
- 文件名:hand_gesture/woman_014_0_7.mp4
- 真实类别:无效手势
- 预测类别:
- 无效手势: 99.82%
- 下滑: 0.12%
- 关闭: 0.04%
- 左滑: 0.01%
- 打开: 0.01%
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。