Video Action Recognition with Neural Architecture Search

Yuanding Zhou (Dalian University of Technology); Baopu Li (BAIDU USA LLC)*; zhihui wang (Dalian University of Technology); Haojie Li (Dalian University of Technology)


Recently, deep convolutional neural networks have been widely used in the field of video action recognition. Current approaches tend to concentrate on the structure design for different backbone networks, but what kind of network structures can process video both efficiently and quickly still remains to be solved despite the encouraging progress. With the help of neural architecture search (NAS), we search for three hyperparameters in the video processing network, which are the number of frames, the number of layers per residual stage and the channel number for all layers. We relax the entire search space into a continuous search space, and search for a set of network architectures that balance accuracy and computational efficiency by considering accuracy as the primary optimization goal and computational complexity as the secondary optimization goal. We conduct experiments on UCF101 and Kinetics400 datasets, validating new state-of-the-art results of the proposed NAS based scheme for video action recognition.