** **Microsoft’s Phi Silica has been enhanced with multimodal functionality, allowing it to understand both text and images on Windows devices. This innovation, leveraging the Florence image encoder, improves accessibility by generating detailed image descriptions for users with visual impairments. The integration is efficient, minimizing resource use while expanding the capabilities of local AI models.** in HTML:**

Unlocking New Dimensions: Multimodal Functionality for Phi Silica
Microsoft is making waves with its latest update to Phi Silica, its cutting-edge small language model (SLM). This new feature introduces vision-based multimodal capabilities, enhancing accessibility and productivity for users.
What’s New?
The integration of vision capabilities allows Phi Silica to understand images alongside text. This multimodal functionality is built directly into Windows, utilizing the NPU on Copilot+ PCs. As Vivek Pradeep, VP at Microsoft, states:
“This additional dimension unlocks new possibilities for local SLMs on Windows.”
By adding image understanding, Microsoft is not only enhancing user experience but also ensuring that the model remains resource-efficient.
Major Updates
One of the standout features is the ability to generate high-quality Alt Text for images. This is particularly beneficial for users with visual impairments. The new model can produce varying levels of detail in descriptions, making it easier for screen readers like Microsoft Narrator to convey content accurately.
Moreover, the system leverages existing components, minimizing the need for additional resources. The upgrade only adds a small 80-million projector model overhead, which is a significant achievement compared to deploying separate models.
What’s Important to Know?
Understanding the architecture behind this update is crucial. The vision encoder, known as Florence, extracts visual features from images. These features are then fed into a modality projector model that produces embeddings compatible with Phi Silica.
Additionally, the training process is optimized for efficiency. The system requires only one crop of the input image, unlike many competing models. This not only speeds up processing but also reduces memory usage.
As a result, users can expect fast and accurate image descriptions. For instance, a short description typically takes about four seconds to generate. In contrast, a detailed description may take around seven seconds.
“The multimodal functionality for Phi Silica enhances the description of screen contents for people who are blind or with low vision.”
In conclusion, the multimodal capabilities of Phi Silica represent a significant leap forward in accessibility and user experience. With future updates planned to support additional languages, Microsoft is committed to making technology more inclusive for everyone.
From the Windows Blog