Visual Pre-training with Limited Synthetic Data - Hirokatsu Kataoka, AIST and University of Oxford
posted on 29 April, 2025


Abstract: It was an established fact that learning representations were acquired by “human-annotated labels” from “large amount of real data”. The trained models were additionally fine-tuned and applied to each visual task. However, recent problems with large-scale datasets are: (i) biased datasets could lead to e.g., gender and racial discrimination, (ii) suspension of public dataset access due to offensive labels, (iii) ethically problematic images are mixed in a large-scale dataset. As long as a large-scale dataset consisting of real images are used, the situation is endlessly problematic. Here, a recent study reveals that a learning strategy even with very few real images and supervision from a mathematical formula successfully acquires a learned representation of how to see the real world. Moreover, a pre-trained model with artificially generated data outperformed ImageNet-21k pre-training and was found to acquire higher robustness. Thus, it is clear that self-supervised learning (SSL), formula-driven supervised learning (FDSL), and synthetic training in the very limited data setting can also develop a DNN model with high accuracy and safety. Moreover, in the era of foundation models, while these critical issues remain unresolved, there’s growing attention on how pre-training can be achieved with very limited data, whether it’s possible with synthetic images or generative models without any real images, and how adaptation can be carried out using zero/one/few-shot or very limited data. Although these topics have not yet attracted much attention in the computer vision field, these research topics must be focused on since they are expected to be a means to replace learning with real data and resolve ethical issues in the future. (Ref.: https://hirokatsukataoka16.github.io/CVPR-2024-LIMIT/https://hirokatsukataoka16.github.io/CVPR-2024-LIMIT/)