Abstract: With the advent of models such as OpenAI CLIP and DALL-E, transformer-based vision-language pre-training has become an increasingly hot research topic. In this talk, I will share some of our recent work in this field that is published in NeurIPS 2020, ECCV 2020 and EMNLP 2020. Specifically, I will answer the following questions. First, how to perform vision-language pre-training? Second, how to understand what has been learned in the pre-trained models? Third, how to enhance the performance of pre-trained models via adversarial training? And finally, how can we extend image-text pre-training to video-text pre-training? Accordingly, I will present UNITER, VALUE, VILLA and HERO to answer these four questions. At last, I will also briefly discuss the challenges and future directions for vision-language pre-training.