DeepMind Claims Image Captioner Alone Is Surprisingly Powerful then Previous Believed, Competing with CLIP

By Storm Warden · March 16, 2026 · 1 min read

ai
computer vision & graphics
machine learning & data science
research
ai

In a new paper Image Captioners Are Scalable Vision Learners Too, a DeepMind research team presents CapPa, a image captioning based pretraining strategy that and can compete CLIP and exhibit favorable model and data scaling properties, verifying that a plain image captioning can be a competitive pretraining strategy for vision backbones.