MaVi Seminar Series


22/10/24 2pm Room: Wills Memorial Building, 1.11, Old Council Chambers: Learning to Model the World (and Yourself) from Vision - Vincent Sitzmann, Assistant Professor at MIT EECS

Abstract: In this talk, I will discuss recent publications from my group that attempt at learning models of the world and the effect of the actions of an agent within that world self-supervised, solely via interaction. In particular, I will discuss the potential and challenges of sequence generative models as a candidate for such a world model, the role of inductive biases using our recent work that discovers the kinematics of a robot as an example, and finally a new research direction in which we attempt to discover the physical rules underlying our world without any inductive biases whatsoever.

17/09/24 2pm Room: QITC Break Out Space, 1st Floor, 1 Catherdral Square: Data-centric AI at test time - Liang Zheng, Associate Professor,(Australian National University)

Abstract: From a complementary perspective to model development, data-centric AI aims to improve and analyse data to better understand AI systems. While significant efforts have been made in understanding training data, this talk will introduce some attempts from my group analysing test data. Specifically, I will talk about how to evaluate the difficulty of the test data, or in other words, the model accuracy, in an unsupervised way, where some measurements of model responses are very useful to characterise model performance. Then, I will also introduce a new video format from which motions can be efficiently captured by existing action recognition networks. Finally, I will discuss a new way of prompting large language models, which is zero-shot, task-agnostic, and prompt-specific. I will conclude with perspectives of data-centric problems and AI workflows.

03/09/24 2pm Room: G.09, Fry Building, Woodland Road, Bristol, BS8 1UG: Computer Vision Technologies for Observing What You Want to See - Hideo Saito, Professor, Keio University, Japan

Abstract: Computer vision (CV) has a broad range of applications that help us overcome the spatial and temporal constraints limiting human observation. One technology that addresses spatial constraints is free-viewpoint image generation, which creates images as if captured from positions where a camera cannot physically be placed. This technique uses 3D models reconstructed through 3D vision technologies, allowing images to be rendered from any viewpoint using computer graphics (CG). The emergence of NeRF has significantly revitalized research in this area. On the other hand, overcoming temporal constraints involves sensing and recognizing rapidly changing objects and accurately presenting their temporal dynamics. Sensing technologies for moving objects is crucial in this context, with the recent focus on advanced cameras such as event cameras. In this talk, I will explore the history and latest developments in free-viewpoint image generation based on 3D vision technologies, as well as CV technologies that utilize event cameras for capturing high-speed phenomena, while also presenting examples from my recent research in these fields.

30/07/24 2pm Room: LG.02, Fry Building: Towards Open-world Long Video Understanding - Weidi Xie (Shanghai Jiao Tong University)

Abstract: Understanding videos has long been of great interest for the vision community. Comparing to the analysis on static images, the extra time axis introduces both challenges and opportunities. In this talk, I will discuss some of the recent works on long video understanding from our group, for example, visual-language alignment on instructional videos, grounded visual question answering on egocentric videos, retrieval-augmented video understanding, open-world instance tracking within videos, etc. For more information, please check the papers here: https://weidixie.github.io/research.htm

05/07/24 2pm Room: Fry Building, G.09, Woodland Rd, Bristol BS8 1UG: Multimodal Video Understanding and Generation - Dr Mike Z. Shou (National University of Singapore) Abstract: Exciting progress has been made in multimodal video intelligence, including both understanding and generation, these two pillars in video. Despite being promising, several key challenges still remain. In this talk, I will introduce our attempts to address some of them. (1) For understanding, I will share All-in-one, which employs one single unified network for efficient video-language modeling, and EgoVLP, which is the first video-language pre-trained model for egocentric video. (2) For generation, I will introduce our study of efficient video diffusion models (i.e., Tune-A-Video, 4K GitHub stars) and long video generation (MagicAnimate, 10K GitHub stars). (3) Finally, I would like to discuss our recent explorations in unifying understanding and generation, from both the data and modeling perspectives.

05/07/24 2pm Room: Fry Building, G.09, Woodland Rd, Bristol BS8 1UG: Multimodal Video Understanding and Generation - Dr Mike Z. Shou (National University of Singapore)

Abstract: Exciting progress has been made in multimodal video intelligence, including both understanding and generation, these two pillars in video. Despite being promising, several key challenges still remain. In this talk, I will introduce our attempts to address some of them. (1) For understanding, I will share All-in-one, which employs one single unified network for efficient video-language modeling, and EgoVLP, which is the first video-language pre-trained model for egocentric video. (2) For generation, I will introduce our study of efficient video diffusion models (i.e., Tune-A-Video, 4K GitHub stars) and long video generation (MagicAnimate, 10K GitHub stars). (3) Finally, I would like to discuss our recent explorations in unifying understanding and generation, from both the data and modeling perspectives.

21/05/24 2pm Room: 1.11 (Old Council Chamber), Wills Memorial Building, Queen’s Rd, Bristol BS8 1RJ: Learning in Fine-Grained Visual Domains - Oisin Mac Aodha (University of Edinburgh)

Abstract: The visual concepts depicted in images from domains such as medicine, biodiversity monitoring, and biological imaging are inherently “fine-grained”. By this, we mean that distinct visual concepts may appear very similar to the untrained eye. Learning these subtle differences can be very challenging in low data regimes where expert annotations are not easily available. In this talk, I will provide an overview of recent research from my group on this topic. I will present work on learning 3D representations of images without requiring explicit 3D supervision, new methods for automatically discovering visual concepts in data, models for estimating the spatial distribution of fine-grained categories, and ongoing work to develop a new dataset for text-based retrieval of fine-grained visual concepts.

30/04/24 2pm Room: 1.11 (Old Council Chamber), Wills Memorial Building, Queen’s Rd, Bristol BS8 1RJ: Improving model generalization with Generative AI and test-time training - Yannis Kalantidis (NAVER LABS Europe)

Abstract: Creating models that can effectively generalise across various tasks and adapt to test-time domain shifts is crucial. In my talk, I will introduce some of my latest work in improving generalization through Generative AI and test-time training. I’ll explore the intriguing question: “Do we still need real images for learning transferable visual representations?” and present our work that studies the use of synthetic data by training models using only images generated by Generative AI models. Additionally, I’ll demonstrate ways of effectively using these models to simulate test-time shifts such as changes in season, weather, or time of day, particularly for visual localization tasks, in order to improve model robustness to such known test-time shifts. The talk will conclude with our recent work on enhancing the robustness of Video Object Segmentation against test-time distribution shifts through test-time training.

11/04/24 12 noon Room: 1.11 (Old Council Chamber), Wills Memorial Building, Queen’s Rd, Bristol BS8 1RJ: Scalable AI - Giuseppe Fiameni (NVIDIA)

Abstract: Deep Neural Networks (DNNs) have witnessed remarkable advancements across diverse domains, continually scaling in size and training on increasingly expansive datasets. This progression has bestowed upon them the remarkable ability to swiftly adapt to novel tasks with minimal training data. However, the journey of training models comprising tens to hundreds of billions of parameters on vast datasets is anything but straightforward. This lecture is designed to offer comprehensive insights into the intricacies of training and deploying the most expansive neural networks.

27/02/24 2pm Room: online: Mining the Latent. A Tuning-Free Paradigm for Versatile Applications with Diffusion Models - Ye Zhu (Princeton University)

Abstract: Diffusion Models, with their core design to model the distribution transition as a stochastic Markovian process, have become state-of-the-art generative models for data synthesis in computer vision. Despite its impressive generation performance, its high training cost has limited the number of research groups that are able to participate and contribute to the work, consequently hindering their downstream applications. In this talk, I will present a novel methodological paradigm to leverage the pre-trained diffusion models for versatile applications by a deep-dive understanding of their latent spaces from both theoretical and empirical perspectives. Specifically, we propose several tuning-free methods for data semantic editing [Zhu et al., NeurIPS 2023], data customization [Wang et al., ICLR 2024], and generalized unseen data synthesis [Zhu et al., arXiv 2024], all by mining the unique properties in the latent spaces, showing the great potential for versatile applications in an efficient and robust tuning-free manner.

30/01/24 5pm Room: online: Visual Intelligence for Autonomous Driving and Robotics - Yue Wang (University of Southern California & Nvidia)

Abstract: Deep learning has demonstrated considerable success embedding images and more general 2D representations into compact feature spaces for downstream tasks like recognition, registration, and generation. Visual intelligence, however, is the missing piece needed for embodied agents to interact with their surrounding environments. To bridge the gap, my present efforts focus on building better visual intelligence with minimal supervision by leveraging geometry, appearance, motion, and any other cues that are naturally available in sensory inputs. In this talk, I will discuss my opinions and experiences towards building visual intelligence for autonomous driving and robotics. First, I will cover our recent work on designing efficient scene representations with geometry, motion, and semantics via self-supervision. Then, I will talk about our work on denoising Vision Transformer features. Finally, I will discuss our recent efforts using LLMs to inform continuous decision making.

21/11/23 2pm Room: QTIC breakout space, 1st floor, 1 Cathedral Square, Trinity St, Bristol BS1 5TE: Beyond Labels. From self- to language-supervised learning, and the 3D World - Iro Laina (University of Oxford)

Abstract: Over the past few years, significant progress has been made to limit the amount of manual annotations required to train models for computer vision tasks. Self-supervised models and generalist (foundation) models have proven extremely powerful on multiple existing benchmarks and have paved the way towards new applications. In this talk, I will discuss our past and current efforts in the domains of self-supervised and language-supervised learning, focusing on understanding how visual concepts are represented in such models and how we can leverage these to extract meaningful information from images, for example segmenting objects. Finally, I will discuss how semantic information and priors in the 2D domain can be lifted to the 3D domain via neural rendering.

31/10/23 2pm Room: online: Bridging Neural and Symbolic Methods for SpatialAI - Krishna Murthy Jatavallabhula (MIT CSAIL)

Abstract: Modern AI approaches have been on a scaling route: bigger models, datasets, and compute infrastructure have resulted in impressive performance in image and text processing. What do these mean for spatial perception, particularly in a robotics context? In this talk, I will (broadly) address two questions: (a) how do we leverage the advanced capabilities offered by large vision and language models for robot perception? (b) Is scaling enough? What other ingredients are we possibly missing? Much of the talk will revolve around (a), covering our recent work on building open-vocabulary 3D maps that may generally be employed across a broad class of robotics applications. Towards the end, we will also discuss (b), looking at promising avenues in the near- and long-term.

05/09/23 2pm Room: Bill Brown Design suite, side a, Queen’s Building, BS8 1TH: The Need for Budgeted Computation in Continual Learning - Adel Bibi (University of Oxford)

Abstract: Continual learning literature is focused on learning from streams under limited access to previously seen data with no restriction on the computational budget. On the contrary, in this talk, we particularly study continual learning under budgeted computation in both offline and online settings. In offline settings, we study, at scale, various previously proposed components, e.g., distillation, sampling strategies, novel loss functions, for when the computational budget is restricted per time step. Moreover, in the online setting, we consider the computational budget through delayed real-time evaluations. That is to say, continual learners that are twice as expensive to train will end up having the model parameters updated half the number of times while being evaluated on every stream sample. Our experiments suggest that the majority of current evaluations were not carried fairly to account for normalized computation. Surprisingly, simple efficient methods outperform the majority of recently proposed, but computationally involved algorithms, in both online and offline. This observation holds on several datasets and experimental settings, i.e., class incremental, data incremental, time distributed settings. This hints that evaluations that do not factor the relative computation between methods can inadequately mislead to incorrect conclusions on the performance.

13/06/23 2pm Room: 1.11 OCC, Wills Memorial Building, BS8 1RJ: The Explanatory Multiverse_Maximising User Agency in Automated Systems - Edward Small (Royal Melbourne Institute of Technology)

Abstract: eXplainable Artifical Intelligence (XAI) is a new but fast-moving area in machine learning. Ultimately, the goal is to improve the experience of using black-box automated systems by allowing users to probe models for explanations of outcomes on three possible levels: explanation of the instance, explanation of the local behaviour, or explanations of the global behaviour. However, what constitutes a good and complete explanation is still an open problem. How do we extract these explanations? What information should be shown to the user, and what should be hidden? And how can we communicate this information effectively to give users true agency within the system? To this end, we have are investigating two aspects of XAI: Are the currently available tools in XAI fit for the layperson? Which types of people are susceptible to bad explanations that can initiate unwarranted trust in a system? https://arxiv.org/pdf/2303.00934.pdf Given that changing an outcome takes time and effort on the part of the user, how can we maximise the likelihood that a user can achieve a counterfactual given that a path defined at time t=0 may become infeasible or more challenging at t>0? https://arxiv.org/pdf/2306.02786.pdf We therefore introduce the concept of an explanatory multiverse that attempts to capture all possible paths to a desired (or all desired) outcomes. We introduce a framework in order to directly compare the geometry of these paths in order to generate addional paths that maximise user-agency under (potentially) imperfect information at t=0.

06/06/23 2pm Room: online: Machine Learning for 3D Content Creation - Jun Gao (University of Toronto)

Abstract: With the increasing demand for creating large-scale 3D virtual worlds in many industries, there is an immense need for diverse and high-quality 3D content. Machine learning is existentially enabling this quest. In this talk, I will discuss how looking from the perspective of combining differentiable iso-surfacing with differentiable rendering could enable 3D content creation at scale and make real-world impact. Towards this end, we first introduce a differentiable 3D representation based on a tetrahedral grid to enable high-quality recovery of 3D mesh with arbitrary topology. By incorporating differentiable rendering, we further design a generative model capable of producing 3D shapes with complex textures and materials for mesh generation. Our framework further paves the way for innovative high-quality 3D mesh creation from text prompt leveraging 2D diffusion models, which democratizes 3D content creation for novice users.

09/05/23 2pm Room: 1.11 Old Council Chamber, Wills Memorial Building: Inhabiting the virtual - Siyu Tang (ETH Zürich)

Abstract: Simulating human behavior and interactions within various environments is crucial for numerous applications, including generating training data for machine learning algorithms, creating autonomous agents for interactive applications like augmented and virtual reality (AR/VR) or computer games, guiding architectural design decisions, and more. In this talk, I will discuss our previous and ongoing research efforts dedicated to modeling and synthesizing digital humans, with the ultimate goal of enabling them to exhibit spontaneous behavior and move autonomously in a digital environment. A key aspect of our work, which I will highlight during the talk, is the development of the Guided Motion Diffusion model. This approach generates high-quality and diverse human motion based on textual prompts and spatial constraints, such as motion trajectories and obstacles. Through a detailed exploration of our research, I will illustrate how these techniques can be applied to various scenarios, ultimately enriching and enhancing the realism of digital human behaviors across multiple domains.

21/03/23 2pm Room: online: DreamBooth. Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation - Nataniel Ruiz (Boston University)

Abstract: We present a new approach for personalization of text-to-image diffusion models. Given few-shot inputs of a subject, we fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject such that we can synthesize fully-novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, appearance modification, and artistic rendering (all while preserving the subject’s key features). We also show diverse new applications of our work undertaken by users that span from creating personalized AI avatars to generating novel art pieces by guiding the network using finetuning.

21/02/23 2pm Room: online: Learning to see in the wild. Should SSL be truly unsupervised - Pedro Morgado (University of Wisconsin-Madison)

Abstract: Self-supervised learning (SSL) aims to eliminate one of the major bottlenecks in representation learning - the need for human annotations. As a result, SSL holds the promise to learn representations from data in the wild, i.e., without the need for curated and static datasets. However, can current self-supervised learning approaches be effective in this setup? In this talk, I will show that the answer is no. While learning in the wild, we expect to see a continuous stream of potentially non-IID data. Yet, state-of-the-art approaches struggle to learn from such data distributions. They are inefficient (both computationally and in terms of data complexity), exhibit signs of forgetting, incapable of modeling dynamics, and result in inferior representation quality. The talk will introduce our recent efforts in tackling these issues.

09/02/23 3pm Room: G.02, 1 Cathedral Square: Self-supervised Learning from images, videos and augmentations - Yuki Asano (University of Amsterdam)

Abstract: In this talk I will talk about pushing the limits of what can be learnt without using any human annotations. After a first overview of what self-supervised learning is, we will first dive into how clustering can be combined with representation learning using optimal transport ([1] @ ICLR’20), a paradigm still relevant in current SoTA models like SwAV/DINO/MSN. Next, I will show how self-supervised clustering can be used for unsupervised segmentation in images ([2] @CVPR’22) and for videos (unpublished research). Finally, we analyse one of the key ingredients of self-supervised learning, the augmentations. Here, I will show that it is possible to extrapolate to semantic classes such as those of ImageNet or Kinetics using just a single datum as visual input when combined with strong augmentations and a pretrained teacher ([3] @ICLR’23).

24/01/23 2pm Room: G.02, 1 Cathedral Square: Getting the Most of Casual Visual Capture - Mohamed Sayed (University College London)

Abstract: When capturing images and video with a camera, there are many ways the capture could be ruined. The camera may be out of focus or be in the wrong place, the images may be blurry, or the subjects of interest could be out of frame. Not only do these errors result in footage of low aesthetic quality, but downstream vision tasks may suffer from reduced accuracy when trying to understand the world through subpar glasses. While the camera operator is usually to blame for these errors, they are not always voluntary. The user may be inexperienced, unable to set the correct settings and accurately position and orient their camera for optimal capture. In other cases, the user is simply preoccupied and unable to focus on deliberate and attention-consuming capture without compromising on another priority - their safety or enjoyment. In this work, we aim to make the most of subpar capture. We first tackle the problem of recovering object detection accuracy under ego-induced motion blur. We then take on the challenge of actively orienting the camera to frame actors for cinematographic filmmaking. Accurate 3D reconstruction requires diverse views, but these views are not always available, so for our third challenge we make the most of views from casual video for accurate depth estimation and mesh reconstruction.

13/12/22 2pm Room: online: Tracking Any Pixel in a Video - Adam Harley (Stanford University)

Abstract: Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this talk, I will present our methods for re-attacking pixel-level motion estimation, based on the idea of treating pixels like “particles”. In this approach, we assume from the outset that pixels have long-range trajectories, and may change appearance over the course of tracking—much like tiny objects. Instead of producing dense flow fields for pairs of frames, our models produce sparse motion fields which travel across dozens of frames. I will demonstrate the advantages of our particle-based methods over optical flow and feature-matching methods, and show new state-of-the-art results for tracking arbitrary pixels across occlusions.

22/11/22 2pm Room: online: How does textual knowledge break the limitations of the current paradigm of multimodal video understanding and reasoning? - Xudong Lin (Columbia University)

Abstract: Our real life is an ongoing multimodal video that involves multiple modalities like video, audio, and text. Conventional multimodal video understanding and reasoning methods rely on a three-stage paradigm to build an AI agent that understands multiple modalities and further reason based on it: modality-specific pretraining for feature extractors, multimodal pretext training for the ability of fuse information from different modalities, fine-tuning on multimodal downstream tasks. However, this paradigm is greatly limited by the fact that: modality-specific pretraining is limited by supervised training or low-quality weak supervision; multimodal pretext training usually requires millions of image/video-text pairs and huge computation resources. In this talk, we will introduce our latest research on how to break these two limitations of the existing paradigm for multimodal video understanding and reasoning, with the help of rich knowledge from either textual data or internal knowledge of pretrained text models.

01/11/22 3pm Room: online: Seeing the unseen Visual content recovery and creation from limited sensory data - Adriana Romero Soriano (Meta AI & McGill University)

Abstract: As humans we never fully observe the world around us and yet we are able to build remarkably useful models of it from our limited sensory data. Machine learning systems are often required to operate in a similar setup, that is the one of inferring unobserved information from the observed one. For example, when inferring 3D shape from a single view image of an object, when reconstructing high fidelity MR images from a subset of frequency measurements, or when modelling a data distribution from a limited set of data points. These partial observations naturally induce data uncertainty, which may hinder the quality of the model predictions. In this talk, I will present our recent work in content recovery and creation from limited sensory data, which leverages active acquisition strategies and user guidance to improve the model outcomes. Finally, I will briefly present the findings of a systematic assessment of potential vulnerabilities and fairness risks of the models we develop.

20/10/22 3pm Room: OCC(1.11), WMB: Is Classification All You Need for Computer Vision? - Angela Yao (National University of Singapore)

Abstract: Classification and regression are two fundamental tasks of machine learning. The choice between the two usually depends on the categorical or continuous nature of the target output. Curiously, in computer vision, specifically with deep learning, regression-type problems such as depth estimation, age estimation, crowd-counting and pose estimation, often yield better performance when formulated as a classification task. The phenomenon of classification outperforming regression on inherently continuous estimation tasks naturally begs the question – why? In this talk, I will highlight some possible causes based on some task-specific investigations for pose estimation and crowd-counting related to label accuracy and strength of supervision. I will then introduce a more general comparison between classification and regression from a learning point of view. Our findings suggest that the key difference lies in the learned feature spaces from the different losses used in classification versus regression.

06/09/22 11am Room: OCC(1.11), WMB: Hyperbolic and Hyperspherical Visual Understanding - Pascal Mettes (University of Amsterdam)

Abstract: Visual recognition by deep learning thrives on examples but commonly ignores broader available knowledge about hierarchical relations between classes. My team focuses on the question of how to integrate hierarchical and broader inductive knowledge about categorization into deep networks. In this talk, I will dive into a few of our recent works that integrate knowledge through hyperbolic and hyperspherical geometry. As a starting point, I will shortly outline what hyperbolic geometry entails, as well as its potential for visual representation learning. I will then outline how to enable the use of hyperbolic geometry for video understanding with hierarchical prior knowledge [CVPR’20]. As a follow-up, I will discuss Hyperbolic Image Segmentation, where we generalize hyperbolic learning to the pixel level with hierarchical knowledge, which opens multiple new doors in segmentation [CVPR’22]. Beyond learning with hierarchical knowledge, I will also revisit a classical inductive bias, namely maximum separation between classes, and show that contrarily to recent literature, this inductive bias is not an optimization problem but has a closed-form hyperspherical solution [Preprint’22]. The solution takes the form of one fixed matrix and only requires a single line of code to add to your network, yet directly boosts categorization, long-tailed recognition, and open-set recognition. The talk concludes with a short overview of other related works from our team and the future potential of hyperbolic and hyperspherical learning for computer vision.

29/07/22 2pm Room: 1CS G.02: Visitors from KAUST - Mattia Slodan, Mengmeng Xu and Humam Alwaseel

Talk 1: Humam Alwassel Title: Self-Supervised Learning by Cross-Modal Audio-Video Clustering Abstract: Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Talk 2: Mattia Slodan Title: Connecting Language and Video to enable Semantic Video Search Abstract: This talk will review the importance that video content has gained in the digital era and give an intuition about the relevance of video search systems. The Video Language Grounding (VLG) task will be proposed as a research direction for building a human-friendly language-based search algorithm able to understand the semantic content of both language and video data and reason about their interactions. We will discuss VLG-Net, a recent Deep Learning architecture that leverages Graph Neural Networks for bridging the two modalities and learning video-language fine-grained alignment. Furthermore, I will introduce MAD, a novel large-scale dataset and benchmark for the VLG task, which solves legacy datasets’ shortcomings while introducing a new long-form setup, opening up new challenges and opportunities for the research community.

Talk 3: Mengmeng (Frost) Xu Title: End-to-End Video Encoder Pre-training for Temporal Action Localization Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder – trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this talk, I will introduce a novel low-fidelity (LoFi) pre-training method and a proposal/gradient sampling method. Instead of always using the full training configurations for TAL learning, those methods reduce the mini-batch composition so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, these enable the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations.

19/04/22 Room: : Cross-domain Action Recognition from Multiple Information Channels - Chiara Plizzari

Abstract: Recognizing human actions from videos is one of the most critical challenges in computer vision since its infancy. An open challenge is that video analysis systems heavily rely on the environment in which the activities are recorded, which inhibits their ability to recognize actions when they are recorded in unfamiliar surroundings or in different lighting conditions. This problem is known in the literature as domain shift and has recently started to attract attention also in the egocentric action classification community. Most of the researchers in the field addressed this issue by reducing the problem to an unsupervised domain adaptation (UDA) setting, where an unlabeled set of samples from the target is available during training. However, the UDA scenario is not always realistic, because the target domain might not be known a priori or because accessing target data at training time might be costly (or plainly impossible). To this purpose, she will present a work which aims to also address the so called Domain Generalization (DG) setting, consisting in learning a representation able to generalize to any unseen domain, regardless of the possibility to access target data at training time. Taking inspiration from recent works on self-supervised audio-video processing, she will show how to solve auxiliary tasks across various information channels from videos in a way that makes the solution of such tasks consistent across information channels and gains robustness from it. Additionally, she will also focus on the possibility to use event data in combination to the standard RGB modality. Indeed, event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of Ą°eventsĄą. Due to their sensing mechanism, event cameras have little to no motion blur, a very high temporal resolution and require significantly less power and memory than traditional frame-based cameras. These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional vision sensors. Finally, she will also present skeleton sequences as an alternative information stream for action recognition. She will present a work on skeleton-based action recognition, by presenting a new model to extract effective information from joint motion patterns and their correlations. In particular, she will present the Spatial¨CTemporal Transformer network, a new architecture to model dependencies between joints using the Transformer self-attention operator.

29/03/22 Room: : Reconstructing Generic (hand-held or isolated) Objects - Shubham Tulsiani

Abstract: We observe and interact with myriad of objects in our everyday lives, from cups and bottles to hammers and tennis rackets. In this talk, I will describe two recent projects aimed at inferring the 3D structure of such generic objects from a RGB single image. Towards reconstructing hand-held objects, I will describe a method that can leverage the cues provided by hand articulation e.g. we grasp a pen differently from a bottle. By learning an implicit reconstruction network that infers pointwise SDFs conditioned on articulation-aware coordinates and pixel-aligned features, I will show that we can reconstruct arbitrary hand-held objects — going beyond the common assumption of known templates when understanding hand-object interaction. I will then focus on scaling 3D prediction to large set of categories, and show how we can learn 3D prediction without 3D supervision. While recent approaches have striven to similarly learn 3D from category-level segmented image collections, they typically learn independent category-specific models from scratch, often relying on adversarial or template-based priors to regularize learning. I will present a simpler and more scalable alternative — learning a unified model across 150 categories while using synthetic 3D data on some categories to help regularize learning for others.

22/02/22 12pm Room: online: 3D object spatial mapping from videos - Kejie Li

Abstract: Localising objects and estimating their geometry in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. In this talk, I will present our approaches to 3D object mapping with different focuses. Assuming a relatively simple and static environment, FroDO (CVPR 2020) is the first-of-its-kind framework to localise and reconstruct the detailed shape of multiple objects in a scene given a sequence of posed RGB video. We later developed MOLTR (RAL and ICRA 2021) to remove the assumption of a static environment. After realising data association is the bottleneck in the pipeline, we proposed ODAM (ICCV 2021) that focuses on associating object detections from different frames using an attentional graph neural network. The volume and location of objects are further refined using multi-view optimisation.

08/02/22 2pm Room: online: Image Segmentation with Semantic Equivariance for Unsupervised Adaptation and Tracking - Nikita Araslanov

Abstract: The high accuracy of modern semantic segmentation models hinges on expensive high-quality dense annotation. Therefore, designing unsupervised objectives to learn semantic representations is of high practical relevance. This talk will focus on one principle towards this goal: semantic equivariance. The underlying idea is to exploit equivariance of the semantic maps to similarity transformations w.r.t. the input image. We will consider specific implementations and extensions of this technique in three problem domains. First, we will take a look at the unsupervised domain adaptation, where we adapt our model, trained on annotated synthetic data, to unlabelled real-world images. In the second example leveraging the equivariance, we will develop an approach to substantially improve model generalisation. In this setting, there is no target distribution available for model adaptation as before, but only a single datum from that distribution. A third example will present an unsupervised learning framework for extracting dense and semantically meaningful object-level correspondences from unlabelled videos. Here, we will exploit the equivariance to sidestep trivial solutions while learning dense semantic representations efficiently. We will highlight some of the limitations common to the discussed methods, and will conclude the presentation with an outlook on follow-up research directions.

18/01/22 2pm Room: online: Vision-language reasoning from coarse categories to fine-grained tokens using sparsely annotated and loosely aligned multimodal data - Bryan Plummer

Abstract: A key challenge in training more discriminative vision-language models is the lack of fully supervised datasets. Only a few language annotations are made for images or videos of the nearly infinite variations that may be applicable, and many annotations may be missing or only loosely associated with their visual counterparts. In this talk I will be discussing how we can learn from these datasets, when and where we can know that language annotations are mutually exclusive to sample better negatives to create more discriminative models during training. I will also introduce novel cross-modal attention mechanisms that enables effective optimization of a contrastive loss during training. I will close out by briefly discussing some challenges in training large multimodal problems, especially as model size continues to grow.

14/12/21 2pm Room: online: Dynamic Neural Networks for Efficient Video Inference - Rameswar Panda

Abstract: Most existing deep neural networks for video understanding rely on one-size-fits-all models, where the exact same fixed set of features are extracted for all inputs or configurations, no matter their complexity. In contrast, humans dynamically allocate time and scrutiny for perception tasks - for example, a single glimpse is sufficient to recognize most simple actions (e.g., “Sleeping”), whereas more time and attention is required to clearly understand complex ones (e.g., “Pulling two ends of something so that it gets stretched”). In this talk, I will discuss some of our recent works on dynamic neural networks, specifically designed for efficient video understanding, which can adaptively adjust computation depending on the input videos. First, I will present a method that learns to select optimal precision conditioned on the input, while taking both accuracy and efficiency into account in recognizing complex actions. Second, I will show how a similar dynamic approach can be extended to make multimodal video recognition more computationally efficient. Finally, I will conclude the talk discussing other ongoing work on efficient vision transformers and few open research problems in dynamic neural networks.

30/11/21 2pm Room: online: Detecting Actions in Videos via Graph Convolutional Networks - Chen Zhao

Abstract: Detecting the time duration when an action happens in a video is an important and fundamental task in video understanding. It is the key problem for various applications, such as extracting highlights in sports and identifying anomaly behaviors in surveillance videos, and it is also an important subtask for other tasks such as video language grounding and video captioning. These years have witnessed its rapid progress in performance as various methods are proposed based on convolutional neural networks (CNNs). However, CNNs face some limitations when exploring the properties of videos. For example, correlations between non-consecutive frames can not be directly utilized; large variations in action temporal duration are not effectively handled. To address these challenges, we introduce graph convolutional networks (GCNs), which were previously mostly used for non-Euclidean data, to the task of temporal action detection. In this talk, I will present several methods to model video data as graphs, and to aggregate information from different parts of the video data via graph convolution. I will demonstrate how GCNs effectively model correlations between long-distance frames and lead to better detection performance, and show how GCNs enable establishing multi-scale correlations and benefit short action detection. I will also discuss the potential of our GCN models to apply to different tasks in video understanding.

19/10/21 2pm Room: online: Towards Robust Representation Learning and Beyond - Cihang Xie

Abstract: Deep learning has transformed computer vision in the past few years. As fueled by powerful computational resources and massive amounts of data, deep networks achieve compelling, sometimes even superhuman, performance on a wide range of visual benchmarks. Nonetheless, these success stories come with bitterness—deep networks are vulnerable to adversarial examples. The existence of adversarial examples reveals that the computations performed by the current deep networks are dramatically different from those by human brains, and, on the other hand, provides opportunities for understanding and improving these models. In this talk, I will first show that the vulnerability of deep networks is a much more severe issue than we thought—the threats from adversarial examples are ubiquitous and catastrophic. Then I will discuss how to equip deep networks with robust representations for defending against adversarial examples. We approach the solution from the perspective of neural architecture design, and show incorporating architectural elements like feature-level denoisers or smooth activation functions can effectively boost model robustness. The last part of this talk will rethink the value of adversarial examples. Rather than treating adversarial examples as a threat to deep networks, we take a further step on uncovering that adversarial examples can help deep networks substantially improve their generalization ability.

05/10/21 2pm Room: online: Learning Sight, Sound, and Language from Videos - Andrew Rouditchenko

Abstract: In this talk, I will describe our recent progress in multimodal learning from video, audio, and text. First, I will introduce a model for audio-visual learning from instructional videos which can relate spoken words and sounds to visual content. We propose a cascaded approach to learning multilingual representations by leveraging a model trained on English HowTo100M videos and applying it to Japanese cooking videos. This improves retrieval performance nearly 10x compared to training on the Japanese videos solely. Next, I will present a model for jointly learning from video, audio, and text that enforces a grouping of semantically similar instances in addition to sharing representations across different modalities. The training pipeline extends instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.

21/09/21 3pm Room: online: Multimodal Learning for Creative Video Applications - Fabian Caba Heilbron

Abstract: Watching and creating videos is a multimodal experience. To understand video, one needs to reason about the movements on screen, the meaning of a speech, and the sound of objects. In this talk, Fabian will discuss two recent works addressing inherently multimodal video problems. First, he will present novel architectures for speaker identification in videos that leverage Spatio-temporal audiovisual context. Then, he will share his quest and latest works on learning the anatomy of video editing. To conclude, he will open up the discussion with exciting frontiers in this rapidly evolving research space.

08/06/21 Room: online: Recent Advances in Vision-Language Pre-training - Zhe Gan

Abstract: With the advent of models such as OpenAI CLIP and DALL-E, transformer-based vision-language pre-training has become an increasingly hot research topic. In this talk, I will share some of our recent work in this field that is published in NeurIPS 2020, ECCV 2020 and EMNLP 2020. Specifically, I will answer the following questions. First, how to perform vision-language pre-training? Second, how to understand what has been learned in the pre-trained models? Third, how to enhance the performance of pre-trained models via adversarial training? And finally, how can we extend image-text pre-training to video-text pre-training? Accordingly, I will present UNITER, VALUE, VILLA and HERO to answer these four questions. At last, I will also briefly discuss the challenges and future directions for vision-language pre-training.

18/05/21 Room: online: Deep Photometric Stereo for Non-Lambertian Surfaces - Kenneth Wong

Abstract: In this talk, we will introduce our recently proposed deep neural networks for solving the photometric stereo problem for non-Lambertian surfaces. Traditional approaches often adopt simplified reflectance models and constrained setups to make the problem more tractable, but this greatly hinders their applications on real-world objects. We propose a deep fully convolutional network, named PS-FCN, that predicts a normal map of an object from an arbitrary number of images captured under different light directions. Compared with other learning based methods, PS-FCN does not depend on a pre-defined set of light directions and can handle multiple images in an order-agnostic manner. To tackle uncalibrated light directions, we propose a deep convolutional neural network, named LCNet, that predicts per-image light direction and intensity from both local and global features extracted from all input images. We analyze the features learned by LCNet and find they resemble attached shadows, shadings, and specular highlights, which are known to provide clues in resolving GBR ambiguity. Based on this insight, we propose a guided calibration network, named GCNet, that explicitly leverages object shape and shading information for improved lighting estimation. Experiments on synthetic and real data will be presented to demonstrate the effectiveness of our proposed networks.

23/03/21 Room: online: Autonomous robot manipulation for planetary and terrestrial applications - Renaud Detry

Abstract: In this talk, I will first discuss the experimental validation of autonomous robot behaviors that support the exploration of Mars’ surface, lava tubes on Mars and the Moon, icy bodies and ocean worlds, and operations on orbit around the Earth. I will frame the presentation with the following questions: What new insights or limitations arise when applying algorithms to real-world data as opposed to benchmark datasets or simulations? How can we address the limitations of real-world environments—e.g., noisy or sparse data, non-i.i.d. sampling, etc.? What challenges exist at the frontiers of robotic exploration of unstructured and extreme environments? I will discuss our approach to validating autonomous machine-vision capabilities for the notional Mars Sample Return campaign, for autonomously navigating lava tubes, and for autonomously assembling modular structures on orbit. The talk will highlight the thought process that drove the decomposition of a validation need into a collection of tests conducted on off-the-shelf datasets, custom/application-specific datasets, and simulated or physical robot hardware, where each test addressed a different range of experimental parameters for sensing/actuation fidelity, breadth of environmental conditions, and breadth of jointly-tested robot functions. Next, I will present a task-oriented grasp model, that en- codes grasps that are configurationally compatible with a given task. The model consists of two independent agents: First, a geometric grasp model that computes, from a depth image, a distribution of 6D grasp poses for which the shape of the gripper matches the shape of the underlying surface. The model relies on a dictionary of geometric object parts annotated with workable gripper poses and preshape parameters. It is learned from experience via kinesthetic teaching. The second agent is a CNN-based semantic model that identifies grasp- suitable regions in a depth image, i.e., regions where a grasp will not impede the execution of the task. The semantic model allows us to encode relationships such as “grasp from the handle.” A key element of this work is to use a deep network to integrate contextual task cues, and defer the structured-output problem of gripper pose computation to an explicit (learned) geometric model. Jointly, these two models generate grasps that are mechanically fit, and that grip on the object in a way that enables the intended task.

16/02/21 Room: online: From Interacting Hands to Expressive and Interacting Humans - Dimitris Tzionas

Abstract: A long-term goal of computer vision and artificial intelligence is to develop human-centred AI that perceives humans in their environments and helps them accomplish their tasks. For this, we need holistic 3D scene understanding, namely modelling how people and objects look, estimating their 3D shape and pose, and inferring their semantics and spatial relationships. For humans and animals this perceptual capability seems effortless, however, endowing computers with similar capabilities has proven to be hard. Fundamentally, the problem involves observing a scene through cameras, and inferring the configuration of humans and objects from images. Challenges exist at all levels of abstraction, from the ill-posed 3D inference from noisy 2D images, to the semantic interpretation of it. The talk will discuss several projects (IJCV’16, TOG’17, CVPR’19, ICCV’19, ECCV’20) that attempt to understand, formalize, and model increasingly complex cases of human-object interactions. These cases range from interacting hands to expressive and interacting whole-body humans. More specifically, the talk will present novel statistical models of the human hand and the whole body, and the usage of these models (1) to efficiently regularize 3D reconstruction from monocular 2D images and eventually (2) to build statistical models of interactions. The presented models are freely available for research purposes.

02/02/21 3pm Room: online: 3D Photography and Videography - Jia-Bin Huang

Abstract: Images and videos allow us to capture and share memorable moments of our lives. However, 2D images and videos appear flat due to the lack of depth perception. In this talk, I will present our recent efforts to overcome these limitations. Specifically, I will cover our recent work for creating compelling 3D photography, estimating consistent video depth for advanced video-based visual effects, and free-viewpoint videos. I will conclude the talk with some ongoing research and research challenges ahead.

Other relevant meetings:


BristolMaVi       uob-mavi-group@bristol.ac.uk

Made with Jekyll based on Kording Lab template.