Azure Availability Sets and Azure Availability Zones explained

Reading Time: 6 minutes

In this blog post I aim to simplify and help you understand the difference between Fault and Update Domains within an Azure Availability Set.

Furthermore, I will discuss the Azure SLA (Service Level Agreement) Microsoft offer on a single Virtual Machine, two or more Virtual Machines configured within an Azure Availability Set and finally Virtual Machines configured in an Azure Availability Zone.

What is an Azure availability Set?

• An availability set is a logical grouping of VMs that allows Azure to understand how your application is built to provide for redundancy and availability.

• Each virtual machine in your availability set is assigned an update domain and a fault domain by the underlying Azure platform. Each availability set can be configured with up to three fault domains and twenty update domains.

• There is no cost for the Availability Set itself, you only pay for each VM instance that you create.

• A VM can only be added to an availability set when it is created. To change the availability set, you need to delete and then recreate the virtual machine.

• Microsoft recommend that two or more VMs are created within an availability set to provide for a highly available application and to meet the 99.95% Azure SLA.

What is an Update Domain?

• Update domains indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time.

• Each virtual machine in your availability set is assigned an update domain by the underlying Azure platform.

• When more than five virtual machines are configured within a single availability set with five update domains, the sixth virtual machine is placed into the same update domain as the first virtual machine, the seventh in the same update domain as the second virtual machine, and so on

• The order of update domains being rebooted may not proceed in a sequence during planned maintenance, but only one update domain is rebooted at a time

• A rebooted update domain is given 30 minutes to recover before maintenance is initiated on a different update domain.

What are Fault Domains?

• Fault domains define the group of virtual machines that share a common power source and network switch. For example, to be rack fault tolerant, your servers and your data must be distributed across multiple racks.

• By default, the virtual machines configured within your availability set are separated across up to three fault domains. Please note, selecting three fault domains may not always be possible.

What are Azure Availability Zones?

• Azure availability zones are physically separate locations within each Azure region that are tolerant to local failures. Failures can range from software and hardware failures to events such as earthquakes, floods, and fires

• Azure regions and availability zones are designed to help you achieve resiliency and reliability for your business critical workloads

• Each Azure region features datacentres deployed within a latency-defined perimeter.

• Each zone is composed of one or more datacentres equipped with independent power, cooling, and networking infrastructure.

• Availability Zones are not available at all regions, visit the following link for more info https://docs.microsoft.com/en-us/azure/availability-zones/az-overview

• 99.99% SLA for workloads deployed in Availability Zones

What are Azure Service Level Agreements (SLAs)?

Azure periodically updates its platform to improve the reliability, performance, and security of the host infrastructure for virtual machines. The purpose of these updates ranges from patching software components in the hosting environment to upgrading networking components or decommissioning hardware. Service-level agreements (SLAs) describe Microsoft’s commitments for uptime and connectivity.

The SLAs for Azure services that Microsoft offer can be located at the following Microsoft website, https://azure.microsoft.com/en-gb/support/legal/sla/

SLA provided on Single Azure VM vs Availability Set vs Availability Zone

To simplify this blog post, below is a table I compiled with the SLA’s provided for a single VM in Azure (with premium SDD or Ultra Disk storage), an availability set and finally an availability zone. We will refer to this table throughout this blog post.

DeploymentSLADowntime
Single VM (with Premium SDD or Ultra Disk)99.9%Per month: 43.83 mins
Availability Set99.95%Per month: 21.92 mins
Availability Zones99.99%Per month: 4.38 mins

Now let’s simplify further,

Deploying a single Azure Virtual Machine

In the diagram below, I demonstrate the deployment of a single virtual machine in Azure. As you can see from the below diagram, a physical host failure or a rack failure may cause down time for your single Virtual machine.

Deploy two Azure Virtual Machines

In the diagram below, I demonstrate the deployment of two virtual machines in Azure. As you can see from the below diagram, there is no guarantee that both VM’s will not be hosted within the same physical rack or physical host. A physical host failure or rack failure may cause down time for both virtual machines.

Deploy virtual web servers in an Azure Availability Set

In the diagram below I deploy an availability set with 2 fault domains and 4 update domains. As you can see from the diagram below, fault domains relate to physical racks with independent power source and network connectivity. Update domains indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time. To simplify update domains, imagine the update domain as the underlying physical host that hosts your virtual machine.

So how does Microsoft deal with the below scenario. As you can see the four virtual web servers have been distributed across two fault domains as in physical racks and 4 update domains as in physical servers. As mentioned earlier in the post, Microsoft only reboot one update domain at a time, so for example, if Microsoft were to perform maintenance on the underlying physical hardware and there was a requirement to reboot your virtual machine, you would only lose 25% of your workloads, in this case one server. Microsoft would wait for 30 minutes to allow the server to stabilise before continuing to reboot the next update domain and so on. The benefit, your solution would continue running without downtime.

If I was to introduce an additional two virtual web servers to my availability set, the below table shows what my scenario would look like. Because I only configured 4 update domains within my Availability set (UD0, UD1, UD2, UD3), the two additional VM’s would return to update 0 (WEB5) and update 1 (WEB6). Therefore, if Microsoft rebooted update domain 3 (UD3), I would temporary lose server WEB4. If Microsoft moved on to rebooting UD0, I would temporary lose WEB1 and WEB5, whilst the remaining servers would remain up and running.

I hope this clarifies the difference between update and fault domains within an Azure Availability set.

Referring back to the SLA table I created further up in this post, your downtime is slashed in half when using Availability Sets compared to deploying a single Virtual Machine using premium or ultra disk.

Azure Availability Zones

Moving onto Azure Availability Zones. Please note that you can not select both Azure Availability Zones and Availability Sets when deploying virtual machines. You have the option to select one.

Azure availability zones are physically separate locations within each Azure region that are tolerant to local failures.

To simplify the above, Azure Availability Zones are independent physical datacentres located in a region. These datacentres reside close (approx 20 miles distance, however could be closer or further away) to the primary datacentre equipped with independent power, cooling, and networking infrastructure. As mentioned further up in this post, all Azure regions do not include Azure Availability Zones.

Most importantly, when deploying into an Azure region, Microsoft offer an SLA of 99.99% which equates s to downtime of 4.38 minutes per month.

If you wish to mitigate the risk of downtime in the event of a region failure such as an major earth quake which could take down the primary datacentre along with the Availability Zones in your region, you could utilise the partner region as a DR location, for example, UK South and UK West.

I hope this has helped