Vision-language reasoning from coarse categories to fine-grained tokens using sparsely annotated and loosely aligned multimodal data - Bryan Plummer
posted on 18 January, 2022


Abstract: A key challenge in training more discriminative vision-language models is the lack of fully supervised datasets. Only a few language annotations are made for images or videos of the nearly infinite variations that may be applicable, and many annotations may be missing or only loosely associated with their visual counterparts. In this talk I will be discussing how we can learn from these datasets, when and where we can know that language annotations are mutually exclusive to sample better negatives to create more discriminative models during training. I will also introduce novel cross-modal attention mechanisms that enables effective optimization of a contrastive loss during training. I will close out by briefly discussing some challenges in training large multimodal problems, especially as model size continues to grow.