High Availability for VMs in Azure

High Availability for VMs in Azure

As per Microsoft's official definition, High availability Refers to a set of technologies that minimize IT disruptions by providing business continuity of IT services through redundant, fault-tolerant, or failover-protected components inside the same data center. In our case, the data center resides within one Azure region.

In Azure, High Availability can be achieved using 2 ways:

  1. Availability Sets

  2. Availability Zones

Availability Sets

An availability set in azure is the logical grouping of VMs which allows azure to know how your application is built and provide for redundancy and availability.

An availability set provides high availability for VMs within one datacenter

There are two main concepts crucial to understand when creating Availability Sets.

  1. Fault Domains

  2. Update Domains

By default, Azure provides 2 Fault Domains and 5 Update Domains in an Availability Set.

Consider the following diagram:

In this representation, the diagram illustrates an Azure data center. The racks symbolize individual groups of servers, each connected to a power source. Let's say we have deployed some VMs across the datacenter as shown in the diagram. A load balancer is used to send the incoming requests to the VMs.

Fault Domain

Let's say, if one of the power sources fails, the entire rack will go down, rendering the VMs inside it unavailable. However, despite this issue, we still have other VMs deployed across the data center. Even if one of the racks fails, other VMs remain operational.

As each individual rack shares a single point of failure, it could be considered a Fault Domain.

So, A Fault Domain can be defined as a logical group of hardware that shares a single point of failure.

Example:

  • We have 40 VMs and 4 FDs.

  • Then each FD will have 10 VMs deployed in it.

  • If a Fault Domain fails, only 10 VMs will be affected.

  • We will still have 30 VMs running.

Update Domain

Let's say a scheduled maintenance is planned for a group of hardware in the data center then all the hardware within the update domain will undergo maintenance at the same time as shown in the diagram. While the VMs in the update domains undergo maintenance, we still have access to other VMs in the datacenter.

If there is a planned maintenance or reboot is scheduled for a group of hardware, then that particular group is known as Update Domain.

So, An Update Domain is defined as a logical group of hardware that undergoes planned maintenance or reboot at the same time.

Example:

  • We have 40 VMs and 20 UDs.

  • Then, each Update Domain will have 2 VMs deployed in it.

  • At a time when an Update Domain undergoes a planned maintenance, only 2 VMs will be affected.

  • We will still have 38 VMs running.

VM Distribution

The VMs are distributed evenly among the Fault Domains and the Update Domains.

Example:

The VMs in the above diagrams are distributed evenly across the given numbers of FDs and UDs. A VM can be assigned to only one Availability Set, that too at the time of creation only.

This is how High Availability can be achieved in VMs within one datacenter.

But, What if the whole datacenter goes down ?

This is where we have another option called Availability Zones.

Availability Zones

Availability Zones are physically separated groups of datacenters within a Azure region which are tolerant to local failures.

  • These Availability Zones are located close enough to have low-latency connections with other Availability Zones.

  • They are connected by high-performance network with round trip latency of around 2ms.

  • However, they are far enough from each other to isolate themselves from fault or failures in other Availability Zones.

  • Each Availability Zone has independent power, cooling and networking infrastructure.

  • An Availability Zone can be assigned to a VM at the time of creation only.

Adding Availability Zone to a VM

Availability Zones and Availability Sets cannot be used together on a VM

Factors Affecting Availability

  1. Application Failures.

  2. Within Datacenter.

    • Hardware failure.

    • Unplanned hardware maintenance.

    • Planned hardware maintenance or reboot.

  3. Outside Datacenter

    • Datacenter Failure

    • Region Failure

Summary

  • Availability Sets

    • Availability Sets are the logical groups of VM within a datacenter

    • They are used to provide High Availability within a datacenter.

    • By default, Azure provides 2 Fault Domains and 5 Update Domains in an Availability Set.

  • Availability Zones

    • Availability Zones are logical grouping of datacenters within an Azure region.

    • They are used to provide High Availability across datacenters.

  • Availability Sets and Availability Zones cannot be used together on a VM.

  • Availability Zone & Availability Set can be assigned to a VM at the time of creation only.

References

https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview?tabs=azure-cli

https://learn.microsoft.com/en-us/azure/virtual-machines/availability-set-overview