Abstract: We observe and interact with myriad of objects in our everyday lives, from cups and bottles to hammers and tennis rackets. In this talk, I will describe two recent projects aimed at inferring the 3D structure of such generic objects from a RGB single image. Towards reconstructing hand-held objects, I will describe a method that can leverage the cues provided by hand articulation e.g. we grasp a pen differently from a bottle. By learning an implicit reconstruction network that infers pointwise SDFs conditioned on articulation-aware coordinates and pixel-aligned features, I will show that we can reconstruct arbitrary hand-held objects — going beyond the common assumption of known templates when understanding hand-object interaction. I will then focus on scaling 3D prediction to large set of categories, and show how we can learn 3D prediction without 3D supervision. While recent approaches have striven to similarly learn 3D from category-level segmented image collections, they typically learn independent category-specific models from scratch, often relying on adversarial or template-based priors to regularize learning. I will present a simpler and more scalable alternative — learning a unified model across 150 categories while using synthetic 3D data on some categories to help regularize learning for others.