Multimodal Video Understanding and Generation - Dr Mike Z. Shou (National University of Singapore)
posted on 5 July, 2024


Abstract: Exciting progress has been made in multimodal video intelligence, including both understanding and generation, these two pillars in video. Despite being promising, several key challenges still remain. In this talk, I will introduce our attempts to address some of them. (1) For understanding, I will share All-in-one, which employs one single unified network for efficient video-language modeling, and EgoVLP, which is the first video-language pre-trained model for egocentric video. (2) For generation, I will introduce our study of efficient video diffusion models (i.e., Tune-A-Video, 4K GitHub stars) and long video generation (MagicAnimate, 10K GitHub stars). (3) Finally, I would like to discuss our recent explorations in unifying understanding and generation, from both the data and modeling perspectives.