Abstract: Deep learning has demonstrated considerable success embedding images and more general 2D representations into compact feature spaces for downstream tasks like recognition, registration, and generation. Visual intelligence, however, is the missing piece needed for embodied agents to interact with their surrounding environments. To bridge the gap, my present efforts focus on building better visual intelligence with minimal supervision by leveraging geometry, appearance, motion, and any other cues that are naturally available in sensory inputs. In this talk, I will discuss my opinions and experiences towards building visual intelligence for autonomous driving and robotics. First, I will cover our recent work on designing efficient scene representations with geometry, motion, and semantics via self-supervision. Then, I will talk about our work on denoising Vision Transformer features. Finally, I will discuss our recent efforts using LLMs to inform continuous decision making.