Abstract: Watching and creating videos is a multimodal experience. To understand video, one needs to reason about the movements on screen, the meaning of a speech, and the sound of objects. In this talk, Fabian will discuss two recent works addressing inherently multimodal video problems. First, he will present novel architectures for speaker identification in videos that leverage Spatio-temporal audiovisual context. Then, he will share his quest and latest works on learning the anatomy of video editing. To conclude, he will open up the discussion with exciting frontiers in this rapidly evolving research space.