Temporal Relation based Attentive Prototype Network for Few-shot Action Recognition

Guangge Wang (Xiamen University); Haihui Ye (Xiamen University); Xiao Wang (Xiamen University); Weirong Ye (Xiamen University); Hanzi Wang (Xiamen University)*


Few-shot action recognition aims at recognizing novel action classes with only a small number of labeled video samples. We propose a temporal relation based attentive prototype network (TRAPN) for few-shot action recognition. Concretely, we tackle this challenging task from three aspects. Firstly, we propose a spatio-temporal motion enhancement (STME) module to highlight the object motions in the videos. The STME module utilizes cues from the content displacements in the videos to enhance the motion-related regions of the features. Secondly, we learn the core common action transformations by our temporal relation (TR) module, which captures the temporal relations from short-term to long-term time scales. The learned temporal relations are encoded into descriptors to constitute sample-level features. The abstract action transformations are described by multiple groups of temporal relation descriptors. Thirdly, a vanilla prototype for the support class (e.g., the mean of the support class) cannot fit well for different query samples. We generate a query-specific prototype constructed from temporal relation descriptors of support samples, which gives more weight to discriminative samples. We evaluate our TRAPN on Kinetics, UCF101 and HMDB51 real-world few-shot datasets. Results show that our network achieves the state-of-the-art performance among competing methods.