Introduction
By Andrew Binstock
In
late 2001, early 2002, two processor families based on the Intel
NetBurst® microarchitecture were introduced: the Pentium® 4 and the
Intel® Xeon® processor families. At the time of the launch, Intel
undertook an extensive positioning initiative that aimed to distinguish
these processor families much more clearly than ever before. A key goal
of the effort was to communicate that Intel Xeon chips are intended for
high-end workstations and servers, while Pentium 4 chips are primarily
for use in desktops.
To make the distinction more clear, Intel
abandoned its previous nomenclature in which Xeon processors were
prepended with name of the sibling Pentium processor; such as the
popular Pentium® III Xeon® chip. Today, there are just Pentium
processors and Xeon processors. The Xeon offerings are further broken
down into two branches: the dual processor Intel Xeon chip and the
Intel Xeon processor MP (Multi Processor).
This article explains
the technology differences between these two Intel Xeon processor
families and how to choose the right one for a specific application. By
extension, it also explains how the developer of a particular software
application or solution can decide which platform to target.¹
¹
To round out the picture, the Celeron® processor was positioned as the
value entry point; with 64-bit behemoth, Itanium® processor aimed at
large, enterprise servers.
Dual vs. Multi-Processor
Intel
Xeon processors, serving dual-processor server and workstation
platforms, can be used solely on motherboards that have either one or
two sockets. Intel Xeon Processor MP-based systems typically have room
for four or more processors, although there is no minimum requirement
of four processors on such systems—they are commonly available with
configurations in which only one or two sockets are populated. This
fact might tempt a prospective customer to purchase only MP processors,
so as to gain maximum configuration flexibility. However, the
processors serve different purposes due to other features of their
architecture—a trait reflected in the premium charged for the MP
processors. The most salient of these features is discussed next.
Cache on Hand
At
the time of this article, Intel Xeon processors (dual processors) ship
with 512KB of level 2 (L2) cache, while the multiprocessor models
contain 1MB or 2MB of level 3 (L3) cache in addition to the 512KB of L2
cache. The additional cache means that considerably more data can be
stored on the processor, of particular advantage for
transaction-oriented servers or systems with high-data throughput. To
see how and why this is so, it's important to understand how the
three-level cache works.
The Intel NetBurst microarchitecture as
it appears in Pentium 4 and Intel Xeon processors has two standard
levels of cache. Level 1 (L1) cache is located inside the chip's
execution core. Data items and instructions that the execution pipeline
will immediately process or has just processed are stored here. At the
time of this article, this cache can hold the equivalent of 12,000
microinstructions (the sub-instructions that make up a complete
assembly-language instruction). This design enables the entire code of
an executing loop to be stored in L1 cache, meaning that the processor
has the fastest possible access to all loop instructions.
The L1
cache is itself fed by internal buses that obtain their data and
instructions from the L2 cache, which at 512KB is substantially larger
than its L1 counterpart. L2 cache on Pentium 4 and Intel Xeon chips
(dual processor) is fed by the front-side bus, which is the bus that
moves data between the processor and main memory (as well as to and
from the AGP graphics subsystem). Data is moved to cache when it's
needed immediately or when predictive mechanisms in the processor
'anticipate' it will be needed by upcoming instructions. Programmers
can complement these mechanisms by manually preloading the L2 cache
with the data items.
The front-side bus (FSB) that feeds the L2
cache can run at various speeds (more on this later). At the base rate
of 400MHz, the FSB could deliver 3.2GB/sec to the L2 cache. Even at
this speed, if the processor needs an item that is not in either of its
caches (a situation known as a cache miss) a steep penalty is imposed
on performance: The processor must idle while data is retrieved from
memory. Despite the bus speed, the memory itself takes a while to find
and deliver the needed data.²
While Intel has built-in, advanced
technology to limit the number of cache misses, the optimal solution is
to provide an even larger cache that can contain more data. This is
particularly true for enterprise applications where many data items may
need to be in play simultaneously to complete a transaction. Responding
to this need, Intel added a third level of cache (L3) to its high-end
32-bit server-oriented chips—the Intel Xeon Processor MP.
This
three-tier design means that if an L2-cache miss occurs, the L3 cache
is examined before a fetch to memory is initiated. With up to 2MB cache
at the L3 level, performance is greatly enhanced by the reduction
memory latencies and by the greater flexibility in cache management
presented by so large a capacity. Later discussion of performance
benchmarks demonstrates this.
² This situation actually presents ideal conditions for use of
Hyper-Threading Technology (HT Technology). Useful work can be performed by a separate thread on a cache miss.
Bus Speeds
The
FSBs of the Intel NetBurst microarchitecture run at effective rates of
400MHz and 533MHz. These are effective rates because of some magic that
Intel performs. The hardware speeds are 100MHz and 133MHz respectively.
However, rather than sending data on every tick, as is traditional on
buses, Intel sends data four times, meaning that the 100MHz bus attains
the data throughput of a 400MHz bus. Because this bus is 64 bits (or 8
bytes) wide, at 400MHz the throughput is 3.2GB/sec. At 533MHz, it's
4.3GB/sec.
Today, the fastest FSB for the dual processor, Intel
Xeon chip is 533MHz, while the multiprocessor model tops out at 400MHz.
It may seem surprising that the higher-end chip would have a slower
clock but, in fact, it's not. The 3.2GB/sec. capacity of the Intel Xeon
processor MP is a very difficult pipe to keep filled. Only in
situations where a processor is reading huge blocks of data can this
level be reached. These situations do occur in imaging and advanced
multimedia applications, which, coincidentally, are also
computationally intensive. Hence, the faster bus is a better fit with
the processor that handles such tasks—the dual processor, Intel Xeon
chip.
Enterprise computing, on the other hand, requires
processing of numerous smaller data items that on aggregate never fully
fill a bus at 3.2GB/sec on a sustained basis. In view of this, Intel
appears to have chosen not to raise the cost of the Intel Xeon
processor MP by giving it a feature not likely to be needed.
Clock Speeds
Clock
speeds for the Intel Xeon processor families follow the same model as
the bus speeds. Currently, the dual processor model tops out at
3.06GHz, while the fastest Intel Xeon processor MP runs at 2GHz. The
speed differential is similar in origin to the difference in bus
speeds: the primary target of the dual processor chip is
computationally intensive workloads—those that benefit from high clocks
and fast buses. While the Intel Xeon processor MP targets systems that
are database and transaction capable. It's important when considering
this distinction to remember that multiprocessor models will typically
reside on servers with 4 to 8 chips. (In typical cases, that is. Models
from IBM and Unisys feature as many as 32 processors.) As such, they
are designed to scale performance by addition of processors. Their
focus is handling multiple threads with multiple transactions; hence
they need the ability to manage large caches and multiple threads while
performing work that is generally not computationally demanding.
Performance Comparison
AnandTech,
a hardware analysis group, compared servers running both models of Xeon
processors in November 2002. In the company's benchmarks (available at *)
run against two databases—one small, the other moderate—the following
results were obtained. (All processors were running with
Hyper-Threading Technology enabled):
3GB Web database benchmarkSystem Configuration | Transactions/sec. |
2 x Intel Xeon processor (2.8 MHz) | 1260 |
2 x Intel Xeon processor MP (2.0 MHz) | 1102 |
4 x Intel Xeon processor MP (2.0 MHz) | 2070 |
Intel's
decision to add L3 cache to the MP model is borne out by these results:
even though the dual processor model's clock is 40% faster, test
results are only 14% faster—clearly the cache is optimally used in
database transactions. The cache design is even more compelling in
light of the remarkable scalability of the Intel Xeon processor MP.
Going from 2 to 4 chips resulted in an 88% increase in throughput. This
is a truly superior level of processor scalability and demonstrates an
optimal use of processor resources.
Now, let's examine a similar
benchmark run against a much larger database—the kind for which the
Intel Xeon processor MP is designed.
25.2GB Ad database benchmarkSystem Configuration | Transactions/sec. |
2 x Intel Xeon processor (2.8 MHz) | 1433 |
2 x Intel Xeon processor MP (2.0 MHz) | 1485 |
4 x Intel Xeon processor MP (2.0 MHz) | 2497 |
With
greater volumes of data now involved, the multiprocessor systems really
come into their own. Note the 2.0GHz multiprocessor system is handling
more transactions than its 2.8GHz DP sibling. In addition, the
scalability between the dual and quad-processing MP systems remains
very high.
Results such as these can be further enhanced by the
chipset that supports each processor. Chipsets supporting the dual
processor model currently are limited to accessing a maximum of 16GB of
RAM, whereas the multiprocessor chipsets scale to 64GB of RAM. This
difference emphasizes the distinguishing feature of the Intel Xeon
processor MP: much greater headroom for expansion.
Summary
The
key distinctions between the two families of Intel Xeon processors can
be summarized by their salient features: the dual processor model
scales to a maximum of two processors and specializes in
computationally demanding applications where fast clocks and FSBs
generate a significant benefit. The multiprocessor model scales to 32
processors and by its large cache delivers optimal performance in
transactional and database applications where large numbers of data
items are in play at once.
As such, Intel recommends that the processors match their software in roughly this manner:
- Application
servers, CRM processing, etc.: Dual processor chips for small
installations, Intel Xeon processor MP for sites with moderate to heavy
loads.
- Database servers, ERP packages, enterprise applications: Intel Xeon processor MP.
With front-end servers the picture is a bit less clear.³
Needless
to say, overlap exists between these categories. When the server you're
considering falls into one of these areas of overlap, Intel recommends
examination of the system's projected expansion requirements. If you
anticipate a need for further capacity or performance, the Intel Xeon
processor MP is the only chip that can provide the headroom. However,
if your needs are modest and you expect that they will not grow
significantly beyond current levels, the dual processor model will
probably suffice.
³ Dual
processor chips appear to be well suited for front-end servers such as
web servers. However, multiprocessor chips are being increasingly used
for this purpose as well.
About the Author
Andrew
Binstock is the principal analyst at Pacific Data Works LLC. He was
previously a senior technology analyst at PricewaterhouseCoopers, and
earlier editor in chief of
UNIX Review and C Gazette. He is the
lead author of "Practical Algorithms for Programmers," from
Addison-Wesley Longman, which is currently in its 12th printing and in
use at more than 30 computer-science departments in the United States.
Related Resources
Related ContentRelated Developer Centers