Learning Sight, Sound, and Language from Videos - Andrew Rouditchenko
posted on 5 October, 2021


Abstract: In this talk, I will describe our recent progress in multimodal learning from video, audio, and text. First, I will introduce a model for audio-visual learning from instructional videos which can relate spoken words and sounds to visual content. We propose a cascaded approach to learning multilingual representations by leveraging a model trained on English HowTo100M videos and applying it to Japanese cooking videos. This improves retrieval performance nearly 10x compared to training on the Japanese videos solely. Next, I will present a model for jointly learning from video, audio, and text that enforces a grouping of semantically similar instances in addition to sharing representations across different modalities. The training pipeline extends instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.