分类:
2006-12-13 09:55:28
Posted by the Engineering Performance Group
As virtualization becomes commonplace in the industry there is increasing interest in measuring the performance of virtualized platforms. Plenty of benchmarks exist to measure the performance of physical systems, but they fail to capture essential aspects of virtual infrastructure performance. We need a common workload and methodology for virtualized systems so that benchmark results can be compared across different platforms.
There are a number of unique challenges in creating sound and meaningful benchmarks for virtualized systems:
Let's take a look at these in turn.
Capture the key performance characteristics of virtual systems. Users compare platforms based on their specific needs -- for example a user running a web server will compare the number of web requests that can be served by each platform, while a DBA will be more interested in the number of database transactions or simultaneous database connections. In the non-virtualized world users typically bind a single application to a single machine, and benchmarks have been developed to provide metrics for important application categories. For my examples of web servers and databases users could look at results for the industry standard SPECweb2005 and TPC-C benchmarks, respectively. Such benchmarks run the application to a point where some resource (usually CPU) is saturated and record the performance metric while respecting quality of service measures such as response time. Users not only compare platforms using the metrics provided by these benchmarks, but over time they build up expertise allowing them to relate their particular environment to published benchmark scores.
However, in a virtual environment, the typical usage of a machine is different from what is common on physical machines. One of the key benefits of virtualization is the ability to run multiple virtual machines on the same physical machine to increase the utilization of server resources. Multiple virtual machines running different operating systems and different applications with diverse resource requirements can all be running on the same machine. Moreover, these applications are typically not bottlenecked on any one resource, and have different (often conflicting) response time requirements. Running single application benchmarks one at a time and then aggregating their metrics might appear to be an easy solution, but that approach doesn’t work. Overall performance can be negatively impacted by competing resource demands among the workloads or positively impacted by optimizations such as transparent page sharing. A good virtualization benchmark must include multiple virtual machines running simultaneously.
Ensure that the benchmark is representative of end user environments. Which workloads should run within the virtual machines in the benchmark? This is a difficult question as users run a wide range of guest operating systems (e.g. Windows, Linux or Solaris), virtual hardware configurations (32-bit, 64-bit, 1, 2 or 4 virtual CPUs) and applications in virtual machines. Any workload included in the benchmark must be representative of end user applications, especially in terms of their resource usage. For many virtual environments, CPU utilization of the workloads is an important factor in the overall performance of the system, but so are memory, storage and network I/O. Any benchmark that aims to measure the performance of virtual environments is incomplete if it does not address these resources. A virtualization benchmark must take into account existing customer use cases and future trends in hardware and software.
Make the benchmark specification platform neutral. Care must be taken to ensure that the benchmark specification does not depend on any platform specifics. We’d like to be able to use the benchmark to answer common customer questions such as “What’s the benefit of dual-core (or quad-core) over single-core processors?”, “How does my storage hardware affect the performance of my overall system?” or “What’s the performance difference between hosted virtualization products (like VMware Server) and bare-metal virtualization products (like VMware ESX Server)?”. Another aim of this benchmark is to drive improvement in future platforms and we would not be able to accomplish this if the benchmark was tied to any specific platform.
Define a single, easy to understand metric. Any good benchmark will have a single, simple metric so that it’s easy to compare different platforms. Secondary metrics can be used to give additional information, but users will base their comparisons on the primary metric. For a virtualization benchmark, should latency or throughput be the primary metric? Or should the load on all the virtual machines be held constant, and CPU utilization used as the metric so that the systems that are able to handle the load with the lowest CPU usage are deemed best? Should the CPU utilization be capped at some limit or should the workload be allowed to saturate the server? How are quality of service constraints factored in? Considerable experimentation is required to determine the best choices for the benchmark design and hence the validity of the benchmark metrics.
Once the metric for each component workload is defined, the next question is how to aggregate them. The aggregation must be done carefully as the units of the underlying workloads can vary widely and we don’t want a single workload unfairly influencing the final metric. The aggregation should also be meaningful with regard to making the benchmark representative of what end users really run. In addition, the metric for a new benchmark must be easy to reason about, make sense to end users, be easy to compute and reflect underlying platform differences.
Provide a methodical way to measure scalability. One of the key benefits of virtualization is being able to consolidate workloads in a scalable manner onto machines. It’s important that the benchmark be able to run on a small two CPU system as well as on the large multicore, multisocket system of tomorrow and provide a meaningful measure of the relative work that can be performed on the two systems. Besides CPU, platform differences in storage and networking hardware can also affect scalability of the system and need to be captured by the benchmark.
In this article we’ve discussed some of the main design challenges for a virtualization benchmark. Turning the design into an easy-to-use benchmark kit brings up additional practical considerations. These include issues such as timing in virtual machines, orchestrating the startup of multiple virtual machines running simultaneously, and determining the right measurement window in the face of bursty workloads. Any approach to creating a benchmark for virtualization must address all these challenges. Creating a benchmark is easy, but creating a credible benchmark that provides a meaningful metric, that measures both workload overhead and scalability, that is representative of end user environments, that cannot be easily defeated, and that is broadly applicable -- is a hard problem!
The benefits of solving this hard problem are great. Having an industry standard way of comparing virtualized solutions will allow users to make more informed decisions regarding the entire stack of virtualization technology. Such a standard can also drive improvements in future hardware and software, again benefiting the industry. For these reasons, VMware is committed to solving this problem. For a while now we've been working on just such a benchmark. We’ve been talking to many of our customers and partners and doing lots of experiments to develop a sound design and methodology. We're referring to this benchmark as VMmark (for Virtual Machine benchMark) and we plan to present it at this November (). This is part of VMware’s larger effort to promote open standards and formats within the industry. We don't intend for this benchmark to become a VMware product. Rather, we’d like to work with others to make this benchmark an industry standard and thus provide users with the standardized method of comparing platforms they have come to expect from enterprise software.