After delivering this session at both VMUG Connect Amsterdam and VMUG Connect Minneapolis, I can honestly say the response exceeded my expectations. The rooms were packed, the questions were sharp, and the feedback was overwhelmingly positive. It's clear that VCF troubleshooting is a topic that resonates deeply with the community right now.
What the session was all about
The goal was simple: move beyond theory and give attendees practical, real-world troubleshooting techniques they could take back to their environments the very next day. The session was built around four core areas.
Preventive Health Checks
Before you can troubleshoot, you need visibility. I walked through three essential tools for keeping your VCF environment healthy. The VCF Diagnostic Tool, a Python-based script that runs locally on your vCenter or SDDC Manager appliance and generates reports in TXT, LOG and JSON format. The VMware Health and Security Toolkit (HST), which gives you a comprehensive security assessment across your entire vSphere environment. And the VCF Build and Integrate Lifecycle Toolkit (BILT), which handles both greenfield validation and brownfield readiness checks before expansions or upgrades.
For post-deployment health monitoring, the SoS Utility remains one of the most powerful tools in the VCF admin's arsenal. Running individual checks like certificate health, connectivity health, NTP synchronization and services health gives you a clear picture of your environment's status at any point in time.
A Real-World Troubleshooting Example
This was the part of the session that generated the most discussion. I walked through an actual customer case — a shoutout to Alexander Bituev from ING Frankfurt for this one — where a workload domain creation failed with a frustratingly vague error in the Fleet Management UI.
The key lesson here was what I call the Golden Rule of VCF troubleshooting: the first error in the log is the root cause. Everything that follows is just noise. Using a simple grep command against the domainmanager.log file, or alternatively drilling into the SDDC Manager API via the built-in Developer Center Swagger UI, reveals the full error chain. In this case the nested error told the complete story: the ESXi host already had vSAN enabled, which blocked the workload domain deployment. The referenceToken shown in the Fleet Management UI is your bridge between the UI and the API — always use it.
New Troubleshooting Features in VCF Operations
VCF Operations has matured significantly and now offers a genuinely powerful set of troubleshooting capabilities. The Diagnostic Findings feature collects data from all VCF platform components every four hours and surfaces both active and historical findings. The Findings Catalog currently contains over 637 checks covering components like vCenter, ESXi, vSAN, NSX and VCF Automation.
The Troubleshooting Workbench is a personal favorite. It correlates events, property changes and anomalous metrics across your entire VCF stack in a single view, making it much faster to identify what changed and when. Combined with the Log Compare feature in VCF Operations for Logs 9.0, which lets you run multiple log queries side by side across different time periods, you have an incredibly powerful toolkit for root cause analysis.
The Storage Operations dashboard provides vSAN cluster health scores from 0 to 100, performance KPIs including IOPS, throughput and latency, and links directly to Broadcom KB articles for remediation. The Network Operations view gives you a holistic topology of your entire NSX environment including the new NSX VPC Dashboard with dropped packet monitoring.
Training and Certification
I also covered the VCF troubleshooting training landscape, which has been significantly expanded. The path runs from the foundational VCF Fundamentals for Technical Support course through the professional-level VCF Troubleshooting course with live lab options, all the way up to the advanced domain-specific playlists that lead to VCAP certifications in Storage, Automation, Operations, VKS and Networking. For those with their sights set even higher, the VCDX Distinguished Expert certification remains the ultimate destination.
The Community Response
Both in Amsterdam and Minneapolis the session sparked great conversations. Whether it was the Golden Rule for log analysis, the referenceToken trick for bridging the Fleet Management UI with the Developer Center API, or the new capabilities in VCF Operations, attendees left with concrete techniques they could apply immediately. That's exactly what VMUG sessions should deliver.
If you missed the session, the slides are available for download, and on SlideShare.
Continue reading "VMware Cloud Foundation Troubleshooting:..." »


