赞
踩
(2019 CVPR)
Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, Tieniu Tan
Notes
For the skeleton sequence, we first map the 3D coordinate of each joint into a high-dimensional feature space using a linear layer and an LSTM layer.
The first linear layer encodes the coordinates of joints into a 256-dim vector as position features . At the same time, frame difference features Vti between two consecutive frames can facilitate the acquisition of dynamic information for AGC-LSTM. In order to take into account this advantages, the concatenation of both features serve as an augmented feature to enrich feature information.
However, the concatenation of position feature Pti and frame difference feature Vti exists the scale variance of the features vectors. Therefore, we adopt an LSTM layer to dispel scale variance between both features:
Various LSTM-based models are employed to learn temporal dynamics of skeleton sequences. However, due to the fully connected operator within LSTM, there is a limitation of ignoring spatial correlation for skeleton-based action recognition. Compared with LSTM, AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains.
where denotes the graph convolution operator
is an attention network that can select discriminative information of key nodes. The sum of
and
as the output aims to strengthen information of key nodes without weakening information of non-focused nodes.
The attention network is employed to adaptively focus on key joints with a soft attention mechanism that can automatically measure the importance of joints. The intermediate hidden state of AGC-LSTM contains rich spatial structural information and temporal dynamics that are beneficial in guiding the selection of key joints. So we first aggregate the information of all nodes as a query feature:
where W is the learnable parameter matrix. Then the attention scores of all nodes can be calculated as:
LSTM layer. Note that, at the last AGC-LSTM layer, the aggregation of all node features will serve as a global feature , and the weighted sum of focused nodes will serve as a local feature
:
Finally, the global feature and local feature
of each time step are transformed into the scores
and
for C classes, where
And the predicted probability being the
class is then obtained as:
During training, considering that the hidden state of each time step on the top AGC-LSTM contains a short-term dynamics, we supervise our model with the following loss:
where is the groundtruth label.
denotes the number of time step on
AGC-LSTM layer. The third term aims to pay equal attention to different joints. The last term is to limit the number of interested nodes. λ and β are weight decaying coefficients. Note that only the sum probability of
and
at the last time step is used to predict the class of the human action.
According to human physical structure, the body can be divided into several parts. Similar to joint-based AGC-LSTM network, we first capture part features with a linear layer and a shared LSTM layer. Then the part features as node representations are fed into three AGC-LSTM layers to model spatial-temporal characteristics.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。