Cross-domain Action Recognition from Multiple Information Channels - Chiara Plizzari
posted on 19 April, 2022


Abstract: Recognizing human actions from videos is one of the most critical challenges in computer vision since its infancy. An open challenge is that video analysis systems heavily rely on the environment in which the activities are recorded, which inhibits their ability to recognize actions when they are recorded in unfamiliar surroundings or in different lighting conditions. This problem is known in the literature as domain shift and has recently started to attract attention also in the egocentric action classification community. Most of the researchers in the field addressed this issue by reducing the problem to an unsupervised domain adaptation (UDA) setting, where an unlabeled set of samples from the target is available during training. However, the UDA scenario is not always realistic, because the target domain might not be known a priori or because accessing target data at training time might be costly (or plainly impossible). To this purpose, she will present a work which aims to also address the so called Domain Generalization (DG) setting, consisting in learning a representation able to generalize to any unseen domain, regardless of the possibility to access target data at training time. Taking inspiration from recent works on self-supervised audio-video processing, she will show how to solve auxiliary tasks across various information channels from videos in a way that makes the solution of such tasks consistent across information channels and gains robustness from it. Additionally, she will also focus on the possibility to use event data in combination to the standard RGB modality. Indeed, event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of Ą°eventsĄą. Due to their sensing mechanism, event cameras have little to no motion blur, a very high temporal resolution and require significantly less power and memory than traditional frame-based cameras. These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional vision sensors. Finally, she will also present skeleton sequences as an alternative information stream for action recognition. She will present a work on skeleton-based action recognition, by presenting a new model to extract effective information from joint motion patterns and their correlations. In particular, she will present the Spatial¨CTemporal Transformer network, a new architecture to model dependencies between joints using the Transformer self-attention operator.