Visitors from KAUST - Mattia Slodan, Mengmeng Xu and Humam Alwaseel
posted on 29 July, 2022


Talk 1: Humam Alwassel Title: Self-Supervised Learning by Cross-Modal Audio-Video Clustering Abstract: Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Talk 2: Mattia Slodan Title: Connecting Language and Video to enable Semantic Video Search Abstract: This talk will review the importance that video content has gained in the digital era and give an intuition about the relevance of video search systems. The Video Language Grounding (VLG) task will be proposed as a research direction for building a human-friendly language-based search algorithm able to understand the semantic content of both language and video data and reason about their interactions. We will discuss VLG-Net, a recent Deep Learning architecture that leverages Graph Neural Networks for bridging the two modalities and learning video-language fine-grained alignment. Furthermore, I will introduce MAD, a novel large-scale dataset and benchmark for the VLG task, which solves legacy datasets’ shortcomings while introducing a new long-form setup, opening up new challenges and opportunities for the research community.

Talk 3: Mengmeng (Frost) Xu Title: End-to-End Video Encoder Pre-training for Temporal Action Localization Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this talk, I will introduce a novel low-fidelity (LoFi) pre-training method and a proposal/gradient sampling method. Instead of always using the full training configurations for TAL learning, those methods reduce the mini-batch composition so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, these enable the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations.