Deep Dive: Understanding VMware vSAN Availability and Resilience

In today's data-driven world, ensuring reliable and resilient access to applications and their data is no longer a luxury—it's a fundamental requirement for nearly all IT organizations. Virtualization has transformed the data center, and with it, storage has evolved. VMware vSAN stands at the forefront of this evolution, shifting the paradigm from traditional, siloed storage arrays to a powerful, software-defined storage solution integrated directly into the hypervisor.

This article explores the core availability technologies within VMware vSAN, with a special focus on the advancements introduced in the Express Storage Architecture (ESA). We will break down how vSAN ensures data remains available and resilient across a wide range of failure scenarios, from a single drive failure to an entire host outage. The insights shared here are based on VMware's official whitepaper, "vSAN Availability Technologies: Resilience Capabilities for vSAN in VMware Cloud Foundation 9.0", published in December 2025.

The Architecture of Resilience: OSA vs. ESA

VMware vSAN is a distributed storage solution that aggregates local storage devices from a cluster of vSphere hosts to create a shared storage pool. This approach offers incredible scalability and flexibility. At a high level, vSAN comes in two flavors: the Original Storage Architecture (OSA) and the more advanced Express Storage Architecture (ESA), introduced in 2022.

According to the VMware whitepaper, ESA represents a significant leap forward in performance, efficiency, and resilience. Unlike OSA, which utilized disk groups with a mix of caching and capacity devices, ESA employs a single tier of high-performance NVMe devices. This streamlined design minimizes the impact of a hardware failure, dramatically reducing the time it takes to restore data to its prescribed level of resilience.

To illustrate the difference, consider the impact of a single storage device failure as documented in the whitepaper:

Metric	vSAN Original Architecture (OSA)	vSAN Express Storage Architecture (ESA)
Host Capacity Impacted	50% (entire disk group)	8.3% (single device)
Data to Resynchronize	24 TB (example)	4 TB (example)
Time to Regain Resilience	Significantly longer	92% decrease in time
I/O Processing Performance	1x (baseline)	2x (double performance)

This table clearly shows that ESA's architecture provides a much smaller failure domain, leading to faster recovery and better overall performance during degraded states.

Building Blocks of Availability: Objects, Components, and Policies

vSAN manages data in the form of objects, which are the primary units of storage that make up a virtual machine, such as its virtual disks (VMDKs) and configuration files. As described in the VMware documentation, these objects can be up to 62TB in size and are broken down into smaller components. The placement of these components across the hosts in the cluster is what determines the resilience of the VM's data.

This placement is not random; it is dictated by Storage Policies. These policies are the cornerstone of vSAN's availability model, allowing administrators to define the desired level of resilience for each VM on a granular basis.

Failures to Tolerate (FTT)

The most critical setting in a storage policy is Failures to Tolerate (FTT). This setting defines how many host, site, or fault domain failures an object can withstand before becoming unavailable. For example:

•FTT=1: The VM's data can tolerate the failure of one host.

•FTT=2: The VM's data can tolerate the simultaneous failure of two hosts.

The whitepaper emphasizes an important distinction: the FTT setting applies only to failures for the hosts that the object resides on, not the total number of failures within a cluster. This object-based approach is what allows vSAN to decouple availability considerations from cluster size, making it less susceptible to the impacts of discrete failures as the cluster size increases.

Data Placement Schemes: RAID-1 vs. Erasure Coding (RAID-5/6)

Once the FTT level is set, the policy defines how that resilience is achieved through a data placement scheme. vSAN primarily uses two methods:

1.RAID-1 (Mirroring): This method creates one or more full, synchronous copies (mirrors) of the object's components on different hosts. It offers the best performance but comes at a higher capacity cost (e.g., FTT=1 requires 2x the capacity).

2.RAID-5/6 (Erasure Coding): This is a more space-efficient method that stripes data and parity information across multiple hosts. RAID-5 can tolerate one failure, while RAID-6 can tolerate two. Historically, erasure coding incurred a performance penalty, but the VMware whitepaper highlights that with the vSAN Express Storage Architecture, RAID-5/6 can be used with performance comparable to RAID-1, making it the preferred choice for most workloads.

Thanks to ESA, RAID-1 mirroring should now primarily be reserved for site-level resilience in Stretched Clusters, as noted in the documentation.

Host Requirements and Auto-Policy Management

The chosen resilience policy has implications for the cluster size. More advanced protection schemes require more hosts to distribute the data and parity components. The whitepaper provides clear guidance on minimum and recommended host counts:

Failures to Tolerate (FTT)	Data Placement	Minimum Hosts	Recommended Hosts
1	RAID-1 or RAID-5	3	4
2	RAID-6	6	7

The recommended configuration includes an additional host beyond the minimum, allowing vSAN to automatically repair data and reestablish the prescribed level of resilience in the event of a sustained host failure.

vSAN 8 and later simplify this process with Auto-Policy Management. As described in the whitepaper, this feature allows vSAN to automatically determine the most appropriate and efficient storage policy based on the characteristics of the cluster, ensuring optimal resilience without manual intervention.

Adaptive Resilience: Intelligence Built In

One of the most impressive features highlighted in the VMware documentation is the Adaptive RAID-5 Erasure Coding capability in ESA. This feature dynamically adjusts the data structure based on the number of hosts in the cluster. For clusters with fewer than six hosts using FTT=1 with RAID-5, vSAN will stripe the data with parity across three hosts. For clusters with six or more hosts, the same policy rule will stripe the data with parity across five hosts, providing even better distribution and efficiency.

This adaptive approach ensures that vSAN automatically optimizes for the best balance of capacity efficiency and performance as your infrastructure grows.

Conclusion: A New Era of Resilient Infrastructure

VMware vSAN provides a robust and highly configurable set of availability technologies designed for the modern, dynamic data center. By abstracting resilience into simple storage policies, it empowers administrators to easily protect their virtual machines against a wide range of hardware failures.

The introduction of the Express Storage Architecture (ESA) marks a pivotal moment for hyper-converged infrastructure. By leveraging a single-tier NVMe architecture and making space-efficient RAID-5/6 erasure coding as performant as traditional mirroring, ESA delivers unprecedented levels of resilience, efficiency, and performance. As organizations continue to scale their virtual environments, vSAN's object-based, policy-driven approach ensures that data availability and protection can scale right along with them.

Source

This article is based on VMware's official whitepaper: "vSAN Availability Technologies: Resilience Capabilities for vSAN in VMware Cloud Foundation 9.0" (December 12, 2025).

📄 Download the full whitepaper: https://www.vmware.com/docs/vmw-vSAN-Availability-Technologies

Deep Dive: Understanding VMware vSAN Availability and Resilience

Eric Sloof

Saturday, December 20. 2025

Deep Dive: Understanding VMware vSAN Availability and Resilience

Recent Entries

Archives