MS Ai Insider

DirectML Update Expands AI Model Support with Phi 3 and Mistral v0.2, Boosts Windows Device Scalability Through Advanced Quantization

Posted by

ailona

–

May 29, 2024

DirectML now supports Phi 3 mini and medium, plus Mistral v0.2, enhancing Windows scalability. With Activation-Aware Quantization, developers can run models on more devices with minimal accuracy loss.-

DirectML’s Leap Forward with Quantization

DirectML has introduced support for Phi 3 mini, marking a significant advancement in the scalability of AI models on Windows platforms. This update not only enhances performance but also broadens the accessibility of advanced models to developers.

What’s New?

The introduction of quantized versions for Phi-3 mini and the expansion to include Phi 3 medium and Mistral v0.2 models represent major updates. Additionally, the integration of a gradio interface simplifies the testing process of these models, leveraging the new ONNX Runtime Generate() API.

Major Updates

Developers now have access to pre-quantized models, including variants for the 4k and 128k versions, enhancing both performance and accessibility.

Importance of Quantization

Quantization addresses the challenge of memory bandwidth in running models on entry-level and older hardware. By reducing model size, it significantly widens the range of devices capable of supporting complex language models.

“Our goal is to ensure scalability, while also maintaining model accuracy.” – Jacques van Rhyn

This approach not only aids in overcoming hardware limitations but also ensures minimal impact on model accuracy through Activation-Aware Quantization (AWQ).

Understanding AWQ

AWQ is a pivotal technique in quantization, focusing on preserving the accuracy of models while achieving memory efficiency. It quantizes 99% of weights while safeguarding the top 1% crucial for model accuracy.

“Thanks to the significant memory wins from AWQ, Phi-3-mini runs at this speed or faster on older discrete GPUs and even laptop integrated GPUs.” – Patrice Vignola

Perplexity Measurements: A Closer Look

Perplexity scores play a crucial role in evaluating model predictions. A lower score indicates a model’s higher certainty in its predictions, reflecting a closer alignment with the true data distribution.

With these enhancements, DirectML is set to revolutionize the way developers approach model scalability and performance on Windows. The integration of quantization techniques like AWQ not only broadens the accessibility of AI models but also ensures a balance between efficiency and accuracy.

As we continue to witness advancements in AI and machine learning, DirectML’s commitment to innovation remains a beacon for developers looking to push the boundaries of what’s possible on Windows platforms.

DirectML’s latest update introduces support for Phi 3 mini and medium, alongside Mistral v0.2, broadening model accessibility.
Activation-Aware Quantization (AWQ) significantly reduces model size while preserving accuracy, enabling performance on entry-level hardware.
Developers can access pre-quantized models for easier implementation, including different variants for specific needs.
A new gradio interface and ONNX Runtime Generate() API are available for streamlined model testing.
Quantization efforts aim to democratize AI by ensuring models run effectively on a wider range of devices, including older GPUs.

From the Windows Blog