Abstract: Our real life is an ongoing multimodal video that involves multiple modalities like video, audio, and text. Conventional multimodal video understanding and reasoning methods rely on a three-stage paradigm to build an AI agent that understands multiple modalities and further reason based on it: modality-specific pretraining for feature extractors, multimodal pretext training for the ability of fuse information from different modalities, fine-tuning on multimodal downstream tasks. However, this paradigm is greatly limited by the fact that: modality-specific pretraining is limited by supervised training or low-quality weak supervision; multimodal pretext training usually requires millions of image/video-text pairs and huge computation resources. In this talk, we will introduce our latest research on how to break these two limitations of the existing paradigm for multimodal video understanding and reasoning, with the help of rich knowledge from either textual data or internal knowledge of pretrained text models.