Microsoft’s Turing Bletchley v3: Revolutionizing Image Search with Multilingual Understanding Across 90 Languages

Posted by

Microsoft’s Turing team unveils Turing Bletchley v3, a multilingual vision-language foundation model, used across Microsoft products. This model, the latest in the Turing Bletchley series, understands over 90 languages and enables efficient image search. It takes either an image or a snippet of text as input and outputs a vector, allowing for improved search capabilities across languages.

Introducing Turing Bletchley v3: A Multilingual Vision-Language Foundation Model

The Turing team at Microsoft is thrilled to unveil Turing Bletchley v3, the latest iteration of our multilingual vision-language foundation models. This advanced model, which comprehends over 90 languages, is now integrated into multiple Microsoft products.

What’s New in Turing Bletchley v3

Building on the success of Turing Bletchley v1, the v3 model has seen significant improvements in datasets, modeling, and pretraining. The goal? To enhance search capabilities across languages, making it more performant and accessible.

“Turing Bletchley v3 enables performant image search across more than 90 languages.”

Understanding Vision-Language Models

Like humans, vision-language models use multiple senses to understand the world. In this case, the senses are vision and language. These models can take in both images and text, making them incredibly versatile in applications such as image search.

How Does It Work?

Turing Bletchley v3 processes either an image or a snippet of text and outputs a vector. The aim is to have the vectors corresponding to an image and a text snippet be “close” if they are semantically related. This makes it a powerful tool for image searches.

“A vision-language model like Turing Bletchley can take either images or text and embed them as points in a vector space.”

Pretraining: The Key to Improved Efficiency

One significant enhancement in Turing Bletchley v3 is the use of masked multimodal learning. This new training paradigm, developed by the Turing team and MSRA, enables efficient multimodal learning. It involves masking some words in a caption describing an image and training a neural network to predict the hidden words based on the image and the text.

Such a strategy, coupled with a large transformer-based model, results in a robust pre-trained model that can be fine-tuned for various downstream tasks. This approach has achieved a new state-of-the-art in many English-only vision language tasks.

With Turing Bletchley v3, Microsoft continues to push the boundaries of technology, making search capabilities more efficient, inclusive, and user-friendly.

  • Turing Bletchley v3 is the latest version of Microsoft’s multilingual multimodal foundation models.
  • The model can understand and process information in more than 90 languages.
  • It takes as input either an image or a snippet of text and outputs a vector in a high-dimensional space.
  • The model’s goal is to have image and text vectors that are semantically related to be “close” to each other in the vector space.
  • Turing Bletchley v3 uses a new training paradigm called masked multimodal learning, improving its performance significantly compared to the previous version.
  • From the Bing Blogs