Data transfer over TCP is very common in vSphere environments. Examples include storage traffic between the VMware ESXi host and an NFS or iSCSI datastore, and various forms of vMotion traffic between vSphere datastores.
VMware has observed that even extremely infrequent TCP issues could have an outsized impact on overall transfer throughput. For example, in VMware's experiments with ESXi NFS read traffic from an NFS datastore, a seemingly minor 0.02% packet loss resulted in an unexpected 35% decrease in NFS read throughput.
In this paper, VMware describes a methodology for identifying TCP issues that are commonly responsible for poor transfer throughput. VMware captures the network traffic of a data transfer into a packet trace file for offline analysis. The packet trace is then analyzed for signatures of common TCP issues that may have a significant impact on transfer throughput.
The TCP issues considered include packet loss and retransmission, long pauses due to TCP timers, and bandwidth delay product (BDP) issues. VMware uses Wireshark to perform the analysis, and a Wireshark profile is provided to simplify the analysis workflow. VMware describes a systematic approach to identify common TCP issues with significant transfer throughput impact and recommends that engineers troubleshooting data transfer throughput performance include this methodology as a standard part of their workflow.
VMware assumes readers are familiar with the relevant TCP concepts described in this paper and have a good working knowledge of Wireshark. For additional information about these topics, refer to the SharkFest Retrospective page of recent SharkFest Conferences.
VMware Cloud Foundation Administrator (VCP-VCF Admin) 2024 certification is a key credential for IT professionals looking to validate their expertise in deploying, managing, and supporting private cloud environments using VMware Cloud Foundation (VCF). This certification is ideal for those transitioning from traditional infrastructure roles to cloud administration. It is specifically tailored for professionals tasked with implementing and maintaining VCF infrastructure, ensuring it aligns with organizational goals for availability, performance, and security. Earning this certification showcases a professional’s capability to efficiently operate and manage VCF environments.
In this paper, VMware gives AI/ML training workload performance test results for the VMware vSphere virtualization platform using multiple NVIDIA A100-80GB GPUs with NVIDIA NVLink; the results fall into the “Goldilocks Zone,” which refers to the area of good performance with virtualization benefits.
Their results show that several virtualized MLPerf Training1 v3.0 benchmarks’ training times perform within 1.06% to 1.08% of the same workloads run on a comparable bare metal system. Note that lower is better.
In addition, we show the MLPerf Inference v3.0 test results for the vSphere virtualization platform with NVIDIA H100 and A100 Tensor Core GPUs. Our tests show that when NVIDIA vGPUs are used in vSphere, the workload performance measured as queries served per second (qps) is 94% to 105% of the performance on the bare metal system. Note that higher is better.
In today's fast-paced technological landscape, staying updated with the latest trends and advancements is crucial for IT professionals, developers, and tech enthusiasts. One of the key players in the virtualization and cloud computing arena, VMware, has taken a significant step to make this easier by offering free access to its VMware Explore Video Library.
VMware Explore is a premier event that brings together industry leaders, practitioners, and innovators to explore the latest advancements in digital transformation, multi-cloud environments, security, and more. It's a hub for learning, networking, and discovering new technologies that can transform how businesses operate and innovate.
The VMware Explore Video Library is an extensive repository of recorded sessions from past VMware Explore events. This treasure trove of knowledge includes keynote addresses, technical sessions, panel discussions, hands-on labs, and expert interviews. The library covers a wide range of topics, from cloud infrastructure and management to security, networking, and modern applications.
Accessing the VMware Explore Video Library is straightforward. Simply visit the VMware Explore website and navigate to the video library section. You don't need to create an account or log in.
VMware's initiative to provide free access to its Explore Video Library is a significant step towards empowering the tech community. Whether you're looking to enhance your skills, stay updated with the latest industry trends, or gain insights from experts, the video library is an invaluable resource. Take advantage of this opportunity to broaden your knowledge and stay ahead in the ever-evolving tech landscape.
Maintaining the availability of the NSX-T management cluster is crucial for ensuring the stability and performance of your virtualized network environment. This blog post will explore strategies to ensure high availability (HA) of NSX-T managers, outline the recovery process during failures, and discuss best practices for disaster recovery.
NSX-T Management Cluster Overview
The NSX-T management cluster typically consists of three nodes. This configuration ensures redundancy and fault tolerance. If one node fails, the cluster retains quorum, and normal operations continue. However, the failure of two nodes can disrupt management operations, requiring swift recovery actions.
High Availability in NSX-T Management Cluster
Quorum Maintenance:
The management cluster needs at least two out of three nodes operational to maintain quorum. This ensures that the NSX Manager UI and related services remain available.
If a node fails, the remaining two nodes can still communicate and manage the environment, preventing downtime.
Node Failures and Impact:
Single Node Failure: The cluster continues to function normally with two nodes.
Two Nodes Failure: The cluster loses quorum, and the NSX Manager UI becomes unavailable. Management operations via CLI and API will also fail.
Recovery Strategies
When a majority of the nodes fail, swift action is required to restore the cluster to a functional state.
Deploying a New Manager Node:
Deploy a new manager node as a fourth member of the existing cluster.
Use the CLI command detach node <node-uuid> or the API endpoint /api/v1/cluster/<node-uuid>?action=remove_node to remove the failed node from the cluster.
This command should be executed from one of the healthy nodes.
Deactivating the Cluster (Optional):
Run the deactivate cluster command on the active node to form a single-node cluster.
Add new nodes to restore the cluster to its three-node configuration.
Best Practices for Disaster Recovery
Regular Backups:
Schedule regular backups of the NSX Manager configurations to facilitate quick recovery.
Store backups securely and ensure they are easily accessible during a disaster recovery scenario.
Geographical Redundancy:
Deploy NSX Managers across multiple sites to ensure geographical redundancy.
In case of a site failure, the other site can take over management operations with minimal disruption.
Proactive Monitoring:
Use NSX-T's built-in monitoring tools and integrate with third-party solutions to continuously monitor the health of the management cluster.
Early detection of issues can prevent major failures and reduce downtime.
Disaster Recovery Sites:
Prepare a disaster recovery site with standby NSX Managers configured to recover from backups.
This setup allows for quick restoration and ensures continuity of operations in case of a primary site failure.
Conclusion
Ensuring the high availability and disaster recovery of your NSX-T management cluster is essential for maintaining a robust and resilient virtual network environment. By following best practices for node management, deploying a geographically redundant setup, and maintaining regular backups, you can minimize downtime and ensure swift recovery from failures.
For a deeper dive into the technical details, check out these resources:
In this video, I'll demonstrate these concepts in action, explore various failure scenarios, and discuss disaster recovery strategies in detail. You can obtain a copy of the original Excalidraw whiteboard file along with the presentation slides in both PDF and PowerPoint formats from GitHub.
In this book, Luciano Patrão takes a step-by-step approach to guarantee that you grasp the fundamental concepts and practical procedures required to construct and manage virtual machines in a VMware vSphere system. Together, you will explore the key components of vSphere, with details and explanations for each feature, including the hypervisor, networking, Storage, and High Availability, unraveling their intricacies and highlighting best practices.
This book provides you with the full VMware knowledge to develop, set up, and maintain vSphere environments that meet modern computing needs. You will also focus on advanced topics, such as resource optimization, performance monitoring, advanced settings, and automation, empowering you to take your virtualization skills to the next level.
Practical, step-by-step approach to vSphere deployment and management
Latest content for transitioning to VMware technologies and preparing for VMware certifications