Microsoft’s Turing team unveils Turing Bletchley v3, a multilingual vision-language foundation model, used across Microsoft products. This model, the latest in the Turing Bletchley series, understands over 90 languages and enables efficient image search. It takes either an image or a snippet of text as input and outputs a vector, allowing for improved search capabilities across languages.

Introducing Turing Bletchley v3: A Multilingual Vision-Language Foundation Model
The Turing team at Microsoft is thrilled to unveil Turing Bletchley v3, the latest iteration of our multilingual vision-language foundation models. This advanced model, which comprehends over 90 languages, is now integrated into multiple Microsoft products.
What’s New in Turing Bletchley v3
Building on the success of Turing Bletchley v1, the v3 model has seen significant improvements in datasets, modeling, and pretraining. The goal? To enhance search capabilities across languages, making it more performant and accessible.
“Turing Bletchley v3 enables performant image search across more than 90 languages.”
Understanding Vision-Language Models
Like humans, vision-language models use multiple senses to understand the world. In this case, the senses are vision and language. These models can take in both images and text, making them incredibly versatile in applications such as image search.
How Does It Work?
Turing Bletchley v3 processes either an image or a snippet of text and outputs a vector. The aim is to have the vectors corresponding to an image and a text snippet be “close” if they are semantically related. This makes it a powerful tool for image searches.
“A vision-language model like Turing Bletchley can take either images or text and embed them as points in a vector space.”
Pretraining: The Key to Improved Efficiency
One significant enhancement in Turing Bletchley v3 is the use of masked multimodal learning. This new training paradigm, developed by the Turing team and MSRA, enables efficient multimodal learning. It involves masking some words in a caption describing an image and training a neural network to predict the hidden words based on the image and the text.
Such a strategy, coupled with a large transformer-based model, results in a robust pre-trained model that can be fine-tuned for various downstream tasks. This approach has achieved a new state-of-the-art in many English-only vision language tasks.
With Turing Bletchley v3, Microsoft continues to push the boundaries of technology, making search capabilities more efficient, inclusive, and user-friendly.
From the Bing Blogs