Data transfer over TCP is very common in vSphere environments. Examples include storage traffic between the VMware ESXi host and an NFS or iSCSI datastore, and various forms of vMotion traffic between vSphere datastores.
VMware has observed that even extremely infrequent TCP issues could have an outsized impact on overall transfer throughput. For example, in VMware's experiments with ESXi NFS read traffic from an NFS datastore, a seemingly minor 0.02% packet loss resulted in an unexpected 35% decrease in NFS read throughput.
In this paper, VMware describes a methodology for identifying TCP issues that are commonly responsible for poor transfer throughput. VMware captures the network traffic of a data transfer into a packet trace file for offline analysis. The packet trace is then analyzed for signatures of common TCP issues that may have a significant impact on transfer throughput.
The TCP issues considered include packet loss and retransmission, long pauses due to TCP timers, and bandwidth delay product (BDP) issues. VMware uses Wireshark to perform the analysis, and a Wireshark profile is provided to simplify the analysis workflow. VMware describes a systematic approach to identify common TCP issues with significant transfer throughput impact and recommends that engineers troubleshooting data transfer throughput performance include this methodology as a standard part of their workflow.
VMware assumes readers are familiar with the relevant TCP concepts described in this paper and have a good working knowledge of Wireshark. For additional information about these topics, refer to the SharkFest Retrospective page of recent SharkFest Conferences.