分类: LINUX
2011-09-17 22:26:21
In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (”Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.
Test SystemsAll systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.
The source code, scripts, and results are all available in a tar file:
ResultsCode Version | Notes | Vector SSE | Large Page | SW Prefetch | 4 KiB pages accessed | Ref System (2p Shanghai) | 2-socket Istanbul | 4-socket Istanbul | 4-socket Magny-Cours | 1-socket PhenomII |
---|---|---|---|---|---|---|---|---|---|---|
Version001 | “-O1″ | – | – | – | 1 | 3.401 GB/s | 3.167 GB/s | 4..311 GB/s | 3.734GB/s | 4.586 GB/s |
Version002 | “-O2″ | – | – | – | 1 | 4.122 GB/s | 4.035 GB/s | 5.719 GB/s | 5.120 GB/s | 5.688 GB/s |
Version003 | 8 partial sums | – | – | – | 1 | 4.512 GB/s | 4.373 GB/s | 5.946 GB/s | 5.476 GB/s | 6.207 GB/s |
Version004 | add SW prefetch | – | – | Y | 1 | 6.083 GB/s | 5.732 GB/s | 6.489 GB/s | 6.389 GB/s | 7.571 GB/s |
Version005 | add vector SSE | Y | – | Y | 1 | 6.091 GB/s | 5.765 GB/s | 6.600 GB/s | 6.398 GB/s | 7.580 GB/s |
Version006 | remove prefetch | Y | – | – | 1 | 5.247 GB/s | 5.159 GB/s | 6.787 GB/s | 6.403 GB/s | 6.976 GB/s |
Version007 | add large pages | Y | Y | – | 1 | 5.392 GB/s | 5.234 GB/s | 7.149 GB/s | 6.653 GB/s | 7.117 GB/s |
Version008 | split into triply-nested loop | Y | Y | – | 1 | 4.918 GB/s | 4.914 GB/s | 6.661 GB/s | 6.180 GB/s | 6.616 GB/s |
Version009 | add SW prefetch | Y | Y | Y | 1 | 6.173 GB/s | 5.901 GB/s | 6.646 GB/s | 6.568 GB/s | 7.736 GB/s |
Version010 | multiple pages/loop | Y | Y | Y | 2 | 6.417 GB/s | 6.174 GB/s | 7.569 GB/s | 6.895 GB/s | 7.913 GB/s |
Version011 | multiple pages/loop | Y | Y | Y | 4 | 7.063 GB/s | 6.804 GB/s | 8.319 GB/s | 7.245 GB/s | 8.583 GB/s |
Version012 | multiple pages/loop | Y | Y | Y | 8 | 7.260 GB/s | 6.960 GB/s | 8.378 GB/s | 7.205 GB/s | 8.642 GB/s |
Version013 | Version010 minus SW prefetch | Y | Y | – | 2 | 5.864 GB/s | 6.009 GB/s | 7.667 GB/s | 6.676 GB/s | 7.469 GB/s |
Version014 | Version011 minus SW prefetch | Y | Y | – | 4 | 6.743 GB/s | 6.483 GB/s | 8.136 GB/s | 6.946 GB/s | 8.291 GB/s |
Version015 | Version011 minus SW prefetch | Y | Y | – | 8 | 6.978 GB/s | 6.578 GB/s | 8.112 GB/s | 6.937 GB/s | 8.463 GB/s |
There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….
Several of the comments above refer to the “Effective Concurrency”,
which I compute as the product of the measured Bandwidth and the idle
memory Latency (see my earlier post
for some example data). For the test cases and systems mentioned
above, the effective concurrency (measured in cache lines) is presented
below:
June 7th, 2011 at 1:06 am
Hi John,
I would be very interested to see if your NUMA systems see a boost in performance by using the following:
numactl –interleave all — cmd
where cmd is your binary.
I have a dual socket Nehalem with 3 channels per socket, all channels populated with DDR3-1333. I calculate the peak theoretical BW as ~60 GB/sec. Using stream, I see about 10 GB/sec, or 1/3 of one socket capacity. When using numactl –interleave all, I see about 20 GB/sec (or 1/3 of total capacity). I don’t get a full 2x, but its close. I can e-mail you the detailed results if you are interested. FWIW, I believe linux will allocate the memory on the socket that first touches it, i.e., for the streaming benchmarks, *all* of the memory is allocated on one socket only (unless the interleave policy is set).
Thank you,
Pete Stevenson
June 7th, 2011 at 4:04 pm
A STREAM result of 9-10 GB/s is typical for a single thread on a Nehalem or Westmere system, whether configured with two or three DDR3 DRAM channels.
The standard STREAM implementation initializes the data using the same processor/memory affinity that the benchmark kernels use, so a “first touch” memory allocation policy should result in (almost) all local accesses. Under Linux, I usually run the benchmark using “numactl” to enforce local first touch allocation.
A dual-socket Nehalem EP system with three channels of DDR3/1066 per socket has a peak bandwidth of 51.2 GB/s and typically delivers 30-31 GB/s on STREAM when using 4, 6, or 8 threads (evenly distributed across the two chips).
A dual-socket Westmere EP system with three channels of DDR3/1333 per socket has a peak bandwidth of 64.0 GB/s and typically delivers 40-41 GB/s on STREAM when using 6, 8, 10, or 12 threads (evenly distributed across the two chips).
My experience has been that performance with interleaving is highly variable because of the 4kB page granularity. Cache-line interleaving can be very effective if sufficient link bandwidth is provided between the chips (as in POWER4/POWER5/POWER7 MCM-based systems), but page-level interleaving usually results in short-lived “hot spots” limiting the scalability.
With cache-line interleaving, each thread fetches two or three streams of cache lines from alternating sockets. This fine granularity allows the memory controller access reordering to work reasonably effectively. On the other hand in the page-level interleaving case, each thread will issue 64 consecutive cache line fetches to each of two or three arrays. It is unlikely that these will be evenly distributed across the two chips, and the large number of contiguous fetches makes it impossible for practical memory controllers to reorder around the blocks.
June 8th, 2011 at 2:44 pm
John,
Thanks for the note in reply. I still think using a page interleaved policy could improve your single thread results (i.e. for situations where your optimizations didn’t push you to a concurrency of 8). Realizing that this is somewhat beside the point — I want to get a version of stream working on my system that demonstrates its peak sustainable bandwidth.
System specs:
Nehalem/Xeon E5520 @ 2.26 GHz
dual socket
3 channels / socket
12 DIMMs @ 4 GB each for 48 GB
all DIMMs are DDR3/1333
It has come to my attention (feel free to correct me, or comment) that the E5520 part only supports up to DDR3/1066, thus my peak theoretical bw is:
2*3*8*1.066 = 51.2 GB/sec
So far I have gotten stream up to 22 GB/sec and I have seen X86membench from BenchIT go to 26 GB/sec. I do like the fact that stream is a much simpler test bench. The question becomes: what are the essential tricks I need to use to get to the highest peak sustainable bandwidth (i.e. as demonstrated by stream)?
Thank you,
Pete Stevenson
June 9th, 2011 at 11:56 am
If I understand your results correctly, then you are already getting amazingly good bandwidth for a single thread on a two-socket system.
For the four kernels of the STREAM benchmark, there are two cases to look at — 1:1 read/write (COPY and SCALE) and 2:1 read/write (ADD and TRIAD).
Case 1: COPY and SCALE kernels with 1:1 read/write traffic:
The Nehalem E5520 runs its QPI links at up to 5.86 Gtransfers/sec, or a peak of 11.5 GB/s per direction. I don’t know the QPI protocol in detail, but for bidirectional traffic generated by one local read, one local non-temporal store, one remote read, and one remote non-temporal store I would expect a protocol overhead (including outbound requests, probe responses, data packet headers, and flow control) of about 45%, so the limiter should be the sustained value of 11.5*55%= ~6.3 GB/s of read traffic plus ~6.3 GB/s write traffic on the QPI link.
The overall peak bandwidth should therefore be about 25-26 GB/s, consisting of 6.3 GB/s for each of (local and remote) (reads and writes).
This is very close to the value that you quoted from X86membench.
Case 2: ADD and TRIAD kernels with 2:1 read/write traffic:
Again, I don’t know the QPI protocol in detail, but for bidirectional traffic generated by two local reads, one local non-temporal store, two remote reads and one remote non-temporal store I would expect a minimum of about 35% protocol overhead (including outbound requests, probe responses, data packet headers, and flow control), so the limiter should be the sustained value of 11.5*65%= ~7.5 GB/s of read traffic on the QPI link.
The total bandwidth (for the ADD and TRIAD kernels) would then be
7.5 GB/s reads from the remote chip
7.5 GB/s reads from the local chip
3.75 GB/s writes to the remote chip
3.75 GB/s writes to the local chip
———-
22.4 GB/s <– very close to what you are observing
Of course these values are just estimates, based on my expectations of the types and sizes of the transactions that have to be included in the QPI cache coherence protocol.
These estimates could be tightened by running carefully controlled microbenchmarks and using the Nehalem performance counters to monitor the specific transactions on the QPI interface, but at first glance I would say that you are pretty close to the limits of what the QPI interface can support for this set of transactions.