How to Build an Automated Recovery Pipeline for Slurm-Managed GPU Clusters on Azure with Real-Time Diagnostics and Microsoft Teams Alerts

Posted by

Discover how to build an automated recovery pipeline for GPU clusters managed by Slurm on Azure. This guide covers detecting job failures, running health diagnostics, automating node reboots, and integrating Microsoft Teams alerts to ensure high availability and quick response in large-scale HPC environments. Unique :

Automating GPU Cluster Recovery with Slurm on Azure: Part 1

Managing large GPU clusters means dealing with inevitable job failures. Microsoft’s latest guide dives into building an automated recovery pipeline using Slurm on Azure. This system not only detects failures but also kicks off health checks and recovery workflows—all without manual intervention.

What’s New: Automated Failure Detection and Recovery

The core innovation here is an automated pipeline that monitors Slurm job failures in real-time. When a job fails, the system identifies unhealthy nodes, runs diagnostics, and triggers reboots if needed. It then reintegrates healthy nodes back into the cluster, ensuring minimal downtime.

“This system detects job failures, identifies unhealthy nodes, performs reboot-based remediation, and reintegrates healthy nodes back into the cluster.”

Using Azure CycleCloud and Slurm’s Epilog feature, the pipeline hooks into job completion events. It runs scripts that check exit codes and node health, then launches Python orchestrators to handle recovery steps.

Major Updates: Epilog Scripts and Health Checks

Slurm’s Epilog script runs after every job finishes, checking if the exit code signals failure. If so, it extracts the list of nodes involved and triggers recovery scripts like cluster_health_orchestrator.py. These scripts perform GPU presence checks, NCCL bandwidth tests, and decide if a reboot is necessary.

Logs are saved in JSON format for traceability, and node states are recorded for further analysis. Restarting the Slurm control daemon ensures these changes take effect smoothly.

Key Quote:

“Slurm’s Job Epilog script allows us to hook into job termination and check the exit status.”

Bonus: Real-Time Microsoft Teams Notifications

One cool addition is integrating Microsoft Teams alerts. When a job fails, an automated notification posts job ID, user, exit code, and affected nodes directly to your Teams channel. This boosts observability and speeds up troubleshooting.

Setting this up involves creating an incoming webhook in Teams and calling a Python script from the Epilog script. This lightweight integration helps operators stay on top of issues without constantly monitoring logs.

Why It Matters

GPU clusters are expensive and complex. Automating failure detection and recovery saves time, reduces downtime, and maximizes resource utilization. Plus, integrating notifications keeps your team informed instantly.

Stay tuned for Part 2, where the focus shifts to orchestrating health checks and automated node reboots in detail.

For tech teams managing HPC workloads on Azure, this approach offers a practical, scalable way to keep GPU clusters humming smoothly.

  • Leverages Slurm’s Epilog scripts to detect job failures in real-time and trigger recovery workflows.
  • Uses Python orchestrator scripts for health checks including GPU presence and NCCL bandwidth diagnostics.
  • Automates node reboot and retry logic to maintain cluster resiliency without manual intervention.
  • Integrates Microsoft Teams notifications via webhook for instant alerts on job failures.
  • Stores detailed logs and node failure data for traceability and troubleshooting in JSON, HTML, or CSV formats.
  • From the New blog articles in Microsoft Community Hub



    Related Posts
    Unlock New Possibilities with Windows Server Devices in Intune!

      Windows Server Devices Now Recognized as a New OS in Intune Microsoft has announced that Windows Server devices are Read more

    Unlock the Power of the Platform: Your Guide to Power Platform at Microsoft Ignite 2022

    Microsoft Power Platform is leading the way in AI-generated low-code app development. With the help of AI, users can quickly Read more

    Unlock the Power of Microsoft Intune with the 2210 October Edition!

    Microsoft Intune is an enterprise mobility management platform that helps organizations manage mobile devices, applications, and data. The October edition Read more

    Unlock the Power of Intune 2.211: What’s New for November!

    Microsoft Intune has released its November edition, featuring new updates to help IT admins better manage their organization’s mobile devices. Read more