Detecting Actions in Videos via Graph Convolutional Networks - Chen Zhao
posted on 30 November, 2021


Abstract: Detecting the time duration when an action happens in a video is an important and fundamental task in video understanding. It is the key problem for various applications, such as extracting highlights in sports and identifying anomaly behaviors in surveillance videos, and it is also an important subtask for other tasks such as video language grounding and video captioning. These years have witnessed its rapid progress in performance as various methods are proposed based on convolutional neural networks (CNNs). However, CNNs face some limitations when exploring the properties of videos. For example, correlations between non-consecutive frames can not be directly utilized; large variations in action temporal duration are not effectively handled. To address these challenges, we introduce graph convolutional networks (GCNs), which were previously mostly used for non-Euclidean data, to the task of temporal action detection. In this talk, I will present several methods to model video data as graphs, and to aggregate information from different parts of the video data via graph convolution. I will demonstrate how GCNs effectively model correlations between long-distance frames and lead to better detection performance, and show how GCNs enable establishing multi-scale correlations and benefit short action detection. I will also discuss the potential of our GCN models to apply to different tasks in video understanding.