当前位置:   article > 正文

论文阅读:An attention enhanced graph convolutional lstm network for skeleton-based action recognition_laga-net: local-and-global attention network for s

laga-net: local-and-global attention network for skeleton based action recog

An attention enhanced graph convolutional lstm network for skeleton-based action recognition

(2019 CVPR)

Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, Tieniu Tan

Notes

 

Contributions

  1. The proposed AGC-LSTM is able to effectively capture discriminative spatiotemporal features. More specially, the attention mechanism is employed to enhance the features of key nodes, which assists in improving spatiotemporal expressions.
  2. A temporal hierarchical architecture is proposed to boost the ability to learn high-level spatiotemporal semantic features and significantly reduce the computational cost.

 


 

Method

Joints Feature Representation

For the skeleton sequence, we first map the 3D coordinate of each joint into a high-dimensional feature space using a linear layer and an LSTM layer.

       The first linear layer encodes the coordinates of joints into a 256-dim vector as position features . At the same time, frame difference features Vti between two consecutive frames can facilitate the acquisition of dynamic information for AGC-LSTM. In order to take into account this advantages, the concatenation of both features serve as an augmented feature to enrich feature information.

       However, the concatenation of position feature Pti and frame difference feature Vti exists the scale variance of the features vectors. Therefore, we adopt an LSTM layer to dispel scale variance between both features:

 

 

Attention Enhanced Graph Convolutional LSTM (AGC-LSTM)

Various LSTM-based models are employed to learn temporal dynamics of skeleton sequences. However, due to the fully connected operator within LSTM, there is a limitation of ignoring spatial correlation for skeleton-based action recognition. Compared with LSTM, AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains.

where  denotes the graph convolution operator  is an attention network that can select discriminative information of key nodes. The sum of and as the output aims to strengthen information of key nodes without weakening information of non-focused nodes.

       The attention network is employed to adaptively focus on key joints with a soft attention mechanism that can automatically measure the importance of joints. The intermediate hidden state  of AGC-LSTM contains rich spatial structural information and temporal dynamics that are beneficial in guiding the selection of key joints. So we first aggregate the information of all nodes as a query feature:

where W is the learnable parameter matrix. Then the attention scores of all nodes can be calculated as:

LSTM layer. Note that, at the last AGC-LSTM layer, the aggregation of all node features will serve as a global feature , and the weighted sum of focused nodes will serve as a local feature :

 

 

Learning AGC-LSTM

Finally, the global feature  and local feature  of each time step are transformed into the scores  and  for C classes, where  And the predicted probability being the  class is then obtained as:

During training, considering that the hidden state of each time step on the top AGC-LSTM contains a short-term dynamics, we supervise our model with the following loss:

where  is the groundtruth label.  denotes the number of time step on  AGC-LSTM layer. The third term aims to pay equal attention to different joints. The last term is to limit the number of interested nodes. λ and β are weight decaying coefficients. Note that only the sum probability of  and  at the last time step is used to predict the class of the human action.

 

 

Joint Model

According to human physical structure, the body can be divided into several parts. Similar to joint-based AGC-LSTM network, we first capture part features with a linear layer and a shared LSTM layer. Then the part features as node representations are fed into three AGC-LSTM layers to model spatial-temporal characteristics.

 


 

Results

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/296131
推荐阅读
相关标签
  

闽ICP备14008679号