TurboTalk
Management in the Age of Virtualization
A recent article by David Vellante claims:
The fact is, most data center managers wouldn’t trust VMware to manage their Tier 1 applications because if something goes wrong performance-wise, you still need to roll in the VMware PhDs to solve it.
While such a statement can be controversial, it is difficult to ignore its valuable substance:
In what follows we consider the first two claims.
A VM exports to its guest OS and applications the semantics of the underlying physical resources, but not the performance guarantees they provide. Indeed, an increase in consolidation ratios and utilization of the physical resources, necessarily means an increase in competition among workloads over these resources. This competition, in turn, can breed complex interference patterns and performance problems.
Consider a sample problem scenario. An application administrator approaches you to increase the CPU budget, allocated to their VM, to handle its growing workloads. You double the VM allocation from 2 vCPUs to 4. Surprisingly, the performance of the application degrades rather than improve.
You face a few challenging questions:
VMWare provides helpful documentation to handle these challenges. Guides to performance monitoring and troubleshooting describe CPU problems and can help you address the first question. There are also articles that discuss performance monitoring counters, esxtop metrics, and their diagnostic meaning. For example, you may see an excessive value of the %RDY counter, describing “percent time spent by a VM waiting for CPU(s) to become available”.
Now, why would the VM wait for CPUs for so long? This indicates competition with other VMs. But shouldn’t a 4 vCPUs configuration win a larger competitive share than 2 vCPUs? The answer to this question is provided in this article about SMP coscheduling mechanisms. These mechanisms seek to provide the VM the semantics of 4 vCPUs. However, when CPU resources are under tighter competition, the waiting periods for 4 vCPUs to become available are longer, as described by RDY%. Once this root cause problem has been determined, problem resolution is straightforward (e.g., free CPUs by shifting VMs to other resources).
This process is perhaps what David Vellante meant by “roll in the VMWare PhD’s”. Indeed, it requires intimate familiarity with the hypervisor mechanisms (e.g., coscheduling); understanding the meaning of performance instrumentation counters and their relationships to underlying performance behaviors; correlating the observed symptoms; analyzing the root cause; and handling it. Furthermore, some of these activities require tight collaborations between the virtualization administrators, application administrators and, possibly, the storage administrators.
To be fair, similar difficulties confronted the management of other emerging technologies. For example, in the early 90’s vendors of routers and LAN switches equipped them with management information bases (MIBs) involving thousands of cryptic counters, not just a few scores. Fortunately, a burgeoning management industry has grown tools that quickly relieved network administrators from the needs to earn network-PhDs. These tools have enabled administrators to monitor behaviors in terms of manageable abstractions, rather than cryptic instrumentation; automate the analysis of the instrumentation data through smart algorithms; and simplify and streamline management actions.
The virtualization industry, likewise, needs to replace PhDs in VMWare, with automation and simplification tools that focus on smart analysis and decisions, rather than tracking counter values. Indeed, such simplification and automation of management is a pre-condition to empowering virtualization to offer OPEX scalability, much as it has been offering CAPEX scalability. I will consider these possibilities in future posts.
Category: Performance