Dynamic Neural Networks for Efficient Video Inference - Rameswar Panda
posted on 14 December, 2021


Abstract: Most existing deep neural networks for video understanding rely on one-size-fits-all models, where the exact same fixed set of features are extracted for all inputs or configurations, no matter their complexity. In contrast, humans dynamically allocate time and scrutiny for perception tasks - for example, a single glimpse is sufficient to recognize most simple actions (e.g., “Sleeping”), whereas more time and attention is required to clearly understand complex ones (e.g., “Pulling two ends of something so that it gets stretched”). In this talk, I will discuss some of our recent works on dynamic neural networks, specifically designed for efficient video understanding, which can adaptively adjust computation depending on the input videos. First, I will present a method that learns to select optimal precision conditioned on the input, while taking both accuracy and efficiency into account in recognizing complex actions. Second, I will show how a similar dynamic approach can be extended to make multimodal video recognition more computationally efficient. Finally, I will conclude the talk discussing other ongoing work on efficient vision transformers and few open research problems in dynamic neural networks.