MS Ai Insider

New blog articles in Microsoft Community Hub

Microsoft Azure Introduces AI Infrastructure Repository for High-Performance Large-Scale AI Supercomputers

Posted by

ailona

–

June 3, 2025

summary:Microsoft has launched the AI Infrastructure on Azure repository, offering ready-to-use, optimized blueprints for building large-scale AI supercomputers. This resource centralizes best practices, health checks, storage guidance, and benchmarks to help teams deploy high-performance AI clusters efficiently on Azure. Unique bullet points:

Microsoft Launches AI Infrastructure on Azure Repository

Microsoft just dropped a game-changer for AI teams building massive clusters on Azure. The new AI Infrastructure on Azure repository acts as a one-stop-shop for deploying large-scale AI supercomputers. It bundles together orchestration, storage, and compute setups into ready-to-run blueprints. This means less guesswork and faster time to peak performance.

What’s New?

The repository offers Infrastructure as Code (IaC) templates that have been battle-tested for reliability and speed. It includes a canonical SLURM-managed HPC cluster template using Azure NetApp Files and Azure Managed Lustre Service for high-performance storage. Plus, it comes with node- and cluster-level health checks and end-to-end benchmark results.

“This repo’s purpose is to centralize our battle-tested configurations and optimization guidance.”

These blueprints help teams avoid weeks of trial and error. Instead, they get a solid baseline to build on and compare their setups against Azure’s published benchmarks.

Major Updates & Design Considerations

Building AI supercomputers is complex. The repo covers key design factors like VM family choices (ND GB200 v6 vs ND H200 v5), storage strategies, container runtimes, and orchestration tools like Slurm or Kueue. It even dives into software stack tuning, checkpointing frequency, and network fabric health.

Storage optimization is a big focus. For example, checkpoint speed directly impacts GPU utilization. The repo offers guidance on using Azure Managed Lustre tiers and BlobFuse2 with Azure Blob Storage—capable of 25 Tbps egress bandwidth—to maximize throughput.

“Validating cluster readiness before large-scale training runs helps catch system issues early.”

Health checks are baked in, using AzureHPC Node Health Checks (AzNHC) and NCCL benchmarks to ensure GPUs and inter-node communication perform flawlessly.

Benchmarks and Real-World Examples

Microsoft’s benchmarks show near-linear scaling from 8 to 1,024 NDv5 H100 GPUs, matching NVIDIA DGX systems. The repo also includes real-world training examples like Megatron-LM GPT-175B and LLM Foundry MPT models, focused on Azure CycleCloud with Slurm orchestration.

Why It Matters

For AI researchers and engineers, this repo slashes setup time and boosts confidence in cluster performance. It’s a huge step toward democratizing access to supercomputer-grade AI infrastructure on Azure.

What’s Next?

Microsoft plans to expand the repo with more scenarios, orchestration options, and best practices. Contributions and feedback are welcome to keep it evolving.

Check out the AI Infrastructure on Azure repo to supercharge your AI workloads with proven, scalable Azure HPC blueprints.

Provides Infrastructure as Code templates for SLURM-managed HPC clusters with Azure NetApp Files and Managed Lustre Service.

Includes node- and cluster-level health checks using AzureHPC Node Health Checks and NCCL benchmarks for system reliability.

Offers storage tuning advice tailored for AI workloads, optimizing checkpoint times and throughput with Azure Blob Storage and Lustre tiers.

Features real-world AI training examples like Megatron-LM GPT175B and LLM Foundry MPT models for practical deployment guidance.

Demonstrates near-linear scaling performance on up to 1,024 NDv5 H100 GPUs, rivaling NVIDIA DGX systems in efficiency and scalability.

From the New blog articles in Microsoft Community Hub