Abstract: Pixel-based generative models nowadays excel at producing visually compelling images and videos, yet they often struggle to preserve underlying physical properties such as shape, motion, and material. Bridging visual and physical modeling could unlock tremendous opportunities across real-world engineering applications, from robotics, interactive VR, design, and manufacturing, to scientific domains such as biological and medical analysis. Inferring physically grounded 3D motion and dynamics from pixels thus becomes a key stepping stone toward these goals. In this talk, I will present our recent efforts on developing data-driven methods for recovering 3D shape and motion from 2D images and videos, in both unsupervised and supervised settings. The resulting model turns a single image of a natural object, including wildlife, into an animatable 3D asset instantly in a feed-forward fashion, enabling efficient and controllable 3D animation for entertainment and analysis.