Chinaunix首页 | 论坛 | 博客
  • 博客访问: 400039
  • 博文数量: 103
  • 博客积分: 3073
  • 博客等级: 中校
  • 技术积分: 1078
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-23 15:04
文章分类

全部博文(103)

文章存档

2012年(13)

2011年(76)

2010年(14)

分类: LINUX

2011-09-17 22:26:21

In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (”Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.

Test Systems
  1. 2-socket AMD Family10h Opteron Revision C2 (”Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
  2. 2-socket AMD Family10h Opteron Revision D0 (”Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  3. 4-socket AMD Family10h Opteron Revision D0 (”Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  4. 4-socket AMD Family10h Opteron 6174, Revision E0 (”Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
  5. 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333

All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.

The source code, scripts, and results are all available in a tar file:

Results
Code Version Notes Vector SSE Large Page SW Prefetch 4 KiB pages accessed Ref System (2p Shanghai) 2-socket Istanbul 4-socket Istanbul 4-socket Magny-Cours 1-socket PhenomII
Version001 “-O1″ 1 3.401 GB/s 3.167 GB/s 4..311 GB/s 3.734GB/s 4.586 GB/s
Version002 “-O2″ 1 4.122 GB/s 4.035 GB/s 5.719 GB/s 5.120 GB/s 5.688 GB/s
Version003 8 partial sums 1 4.512 GB/s 4.373 GB/s 5.946 GB/s 5.476 GB/s 6.207 GB/s
Version004 add SW prefetch Y 1 6.083 GB/s 5.732 GB/s 6.489 GB/s 6.389 GB/s 7.571 GB/s
Version005 add vector SSE Y Y 1 6.091 GB/s 5.765 GB/s 6.600 GB/s 6.398 GB/s 7.580 GB/s
Version006 remove prefetch Y 1 5.247 GB/s 5.159 GB/s 6.787 GB/s 6.403 GB/s 6.976 GB/s
Version007 add large pages Y Y 1 5.392 GB/s 5.234 GB/s 7.149 GB/s 6.653 GB/s 7.117 GB/s
Version008 split into triply-nested loop Y Y 1 4.918 GB/s 4.914 GB/s 6.661 GB/s 6.180 GB/s 6.616 GB/s
Version009 add SW prefetch Y Y Y 1 6.173 GB/s 5.901 GB/s 6.646 GB/s 6.568 GB/s 7.736 GB/s
Version010 multiple pages/loop Y Y Y 2 6.417 GB/s 6.174 GB/s 7.569 GB/s 6.895 GB/s 7.913 GB/s
Version011 multiple pages/loop Y Y Y 4 7.063 GB/s 6.804 GB/s 8.319 GB/s 7.245 GB/s 8.583 GB/s
Version012 multiple pages/loop Y Y Y 8 7.260 GB/s 6.960 GB/s 8.378 GB/s 7.205 GB/s 8.642 GB/s
Version013 Version010 minus SW prefetch Y Y 2 5.864 GB/s 6.009 GB/s 7.667 GB/s 6.676 GB/s 7.469 GB/s
Version014 Version011 minus SW prefetch Y Y 4 6.743 GB/s 6.483 GB/s 8.136 GB/s 6.946 GB/s 8.291 GB/s
Version015 Version011 minus SW prefetch Y Y 8 6.978 GB/s 6.578 GB/s 8.112 GB/s 6.937 GB/s 8.463 GB/s
Comments

There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….

  • For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
  • The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
    ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large.
  • This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
  • The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.

Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:

Share and Enjoy:
  • Facebook
  • LinkedIn
4 Responses to “Optimizing AMD Opteron Memory Bandwidth, Part 5: single-thread, read-only”
  1.   Pete Stevenson Says:
    June 7th, 2011 at 1:06 am

    Hi John,

    I would be very interested to see if your NUMA systems see a boost in performance by using the following:
    numactl –interleave all — cmd
    where cmd is your binary.

    I have a dual socket Nehalem with 3 channels per socket, all channels populated with DDR3-1333. I calculate the peak theoretical BW as ~60 GB/sec. Using stream, I see about 10 GB/sec, or 1/3 of one socket capacity. When using numactl –interleave all, I see about 20 GB/sec (or 1/3 of total capacity). I don’t get a full 2x, but its close. I can e-mail you the detailed results if you are interested. FWIW, I believe linux will allocate the memory on the socket that first touches it, i.e., for the streaming benchmarks, *all* of the memory is allocated on one socket only (unless the interleave policy is set).

    Thank you,
    Pete Stevenson

  2.    Says:
    June 7th, 2011 at 4:04 pm

    A STREAM result of 9-10 GB/s is typical for a single thread on a Nehalem or Westmere system, whether configured with two or three DDR3 DRAM channels.

    The standard STREAM implementation initializes the data using the same processor/memory affinity that the benchmark kernels use, so a “first touch” memory allocation policy should result in (almost) all local accesses. Under Linux, I usually run the benchmark using “numactl” to enforce local first touch allocation.

    A dual-socket Nehalem EP system with three channels of DDR3/1066 per socket has a peak bandwidth of 51.2 GB/s and typically delivers 30-31 GB/s on STREAM when using 4, 6, or 8 threads (evenly distributed across the two chips).
    A dual-socket Westmere EP system with three channels of DDR3/1333 per socket has a peak bandwidth of 64.0 GB/s and typically delivers 40-41 GB/s on STREAM when using 6, 8, 10, or 12 threads (evenly distributed across the two chips).

    My experience has been that performance with interleaving is highly variable because of the 4kB page granularity. Cache-line interleaving can be very effective if sufficient link bandwidth is provided between the chips (as in POWER4/POWER5/POWER7 MCM-based systems), but page-level interleaving usually results in short-lived “hot spots” limiting the scalability.

    With cache-line interleaving, each thread fetches two or three streams of cache lines from alternating sockets. This fine granularity allows the memory controller access reordering to work reasonably effectively. On the other hand in the page-level interleaving case, each thread will issue 64 consecutive cache line fetches to each of two or three arrays. It is unlikely that these will be evenly distributed across the two chips, and the large number of contiguous fetches makes it impossible for practical memory controllers to reorder around the blocks.

  3.   Pete Stevenson Says:
    June 8th, 2011 at 2:44 pm

    John,

    Thanks for the note in reply. I still think using a page interleaved policy could improve your single thread results (i.e. for situations where your optimizations didn’t push you to a concurrency of 8). Realizing that this is somewhat beside the point — I want to get a version of stream working on my system that demonstrates its peak sustainable bandwidth.

    System specs:
    Nehalem/Xeon E5520 @ 2.26 GHz
    dual socket
    3 channels / socket
    12 DIMMs @ 4 GB each for 48 GB
    all DIMMs are DDR3/1333

    It has come to my attention (feel free to correct me, or comment) that the E5520 part only supports up to DDR3/1066, thus my peak theoretical bw is:
    2*3*8*1.066 = 51.2 GB/sec

    So far I have gotten stream up to 22 GB/sec and I have seen X86membench from BenchIT go to 26 GB/sec. I do like the fact that stream is a much simpler test bench. The question becomes: what are the essential tricks I need to use to get to the highest peak sustainable bandwidth (i.e. as demonstrated by stream)?

    Thank you,
    Pete Stevenson

  4.    Says:
    June 9th, 2011 at 11:56 am

    If I understand your results correctly, then you are already getting amazingly good bandwidth for a single thread on a two-socket system.

    For the four kernels of the STREAM benchmark, there are two cases to look at — 1:1 read/write (COPY and SCALE) and 2:1 read/write (ADD and TRIAD).

    Case 1: COPY and SCALE kernels with 1:1 read/write traffic:

    The Nehalem E5520 runs its QPI links at up to 5.86 Gtransfers/sec, or a peak of 11.5 GB/s per direction. I don’t know the QPI protocol in detail, but for bidirectional traffic generated by one local read, one local non-temporal store, one remote read, and one remote non-temporal store I would expect a protocol overhead (including outbound requests, probe responses, data packet headers, and flow control) of about 45%, so the limiter should be the sustained value of 11.5*55%= ~6.3 GB/s of read traffic plus ~6.3 GB/s write traffic on the QPI link.
    The overall peak bandwidth should therefore be about 25-26 GB/s, consisting of 6.3 GB/s for each of (local and remote) (reads and writes).
    This is very close to the value that you quoted from X86membench.

    Case 2: ADD and TRIAD kernels with 2:1 read/write traffic:

    Again, I don’t know the QPI protocol in detail, but for bidirectional traffic generated by two local reads, one local non-temporal store, two remote reads and one remote non-temporal store I would expect a minimum of about 35% protocol overhead (including outbound requests, probe responses, data packet headers, and flow control), so the limiter should be the sustained value of 11.5*65%= ~7.5 GB/s of read traffic on the QPI link.

    The total bandwidth (for the ADD and TRIAD kernels) would then be
    7.5 GB/s reads from the remote chip
    7.5 GB/s reads from the local chip
    3.75 GB/s writes to the remote chip
    3.75 GB/s writes to the local chip
    ———-
    22.4 GB/s <– very close to what you are observing

    Of course these values are just estimates, based on my expectations of the types and sizes of the transactions that have to be included in the QPI cache coherence protocol.
    These estimates could be tightened by running carefully controlled microbenchmarks and using the Nehalem performance counters to monitor the specific transactions on the QPI interface, but at first glance I would say that you are pretty close to the limits of what the QPI interface can support for this set of transactions.

阅读(1268) | 评论(0) | 转发(0) |
0

上一篇:nehalem 指令执行时间

下一篇:Memory Terms

给主人留下些什么吧!~~