分类: LINUX
2014-04-02 11:35:37
Applies to: ,
Many applications frequently copy substantial amounts of data from one area of memory to another, typically using the memcpy() C library function. As this can be quite time consuming, it may be worth spending some time optimizing the functions that do this. There is no single ‘best method’ for implementing a copy routine, as the performance will depend upon many factors (see below). This article looks at seven possible schemes, and compares their performance in a particular system.
Target system and conditions
The various schemes were tested using a Beagle Board from Texas Instruments. This board is based on an OMAP 3530 SoC, which is based on an ARM Cortex-A8 processor. The implementations have been written with the assumption that the source address, destination address and number of bytes to transfer are all multiples of the level 1 cache line size (64 bytes). However, the tests all copied 16 Mbytes of data, so the overhead of checking alignment and ensuring the main loop could assume 64 byte alignment would be insignificant by comparison (this may not be the case for smaller memory copies). The execution time was measured using the performance counters integrated into the Cortex-A8 processor, which provide a highly accurate measure.
For all tests, the L1NEON bit was set, meaning that loads using the NEON instruction set can cause an L1 data cache linefill. Both the level 1 and level 2 caches are enabled, with both the code and data memory regions being used marked as cacheable. The MMU and branch prediction are also enabled.
Some of the routines make use of the preload instruction (PLD). This instruction causes the level 2 cache to start loading the data some time before the processor executes the code to access this data. This effectively starts the memory request early, so may mean the processor does not have to wait as long for it be available.
Routines:
Word by Word memory copy
This is a very simple loop which reads one word from the source, writes it to the destination and decrements a counter. The performance of this function is taken as the reference for the other tests.
WordCopy
LDR r3, [r1], #4
STR r3, [r0], #4
SUBS r2, r2, #4
BGE WordCopy
Load-Multiple memory copy
The previous example is modified to use LDM and STM instructions, transferring 8 words per iteration. Due to the extra registers used, these must be stored to the stack and later restored.
LDMCopy
PUSH {r4-r10}
LDMloop
LDMIA r1!, {r3 - r10}
STMIA r0!, {r3 - r10}
SUBS r2, r2, #32
BGE LDMloop
POP {r4-r10}
This implementation uses NEON load and store instructions.
NEONCopy
VLDM r1!, {d0-d7}
VSTM r0!, {d0-d7}
SUBS r2, r2, #0x40
BGE NEONCopy
Word by Word memory copy with preload
WordCopyPLD
PLD [r1, #0x100]
MOV r12, #16
WordCopyPLD1
LDR r3, [r1], #4
STR r3, [r0], #4
SUBS r12, r12, #1
BNE WordCopyPLD1
SUBS r2, r2, #0x40
BNE WordCopyPLD
NEON memory copy with preload
NEONCopyPLD
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
Mixed ARM and NEON memory copy with preload
This final implementation interleaves ARM and NEON multiple load and store instructions, as well as using the preload instruction.
ARMNEONPLD
PUSH {r4-r11}
MOV r3, r0
ARMNEON
PLD [r1, #192]
PLD [r1, #256]
VLD1.64 {d0-d3}, [r1@128]!
VLD1.64 {d4-d7}, [r1@128]!
VLD1.64 {d16-d19}, [r1@128]!
LDM r1!, {r4-r11}
SUBS r2, r2, #128
VST1.64 {d0-d3}, [r3@128]!
VST1.64 {d4-d7}, [r3@128]!
VST1.64 {d16-d19}, [r3@128]!
STM r3!, {r4-r11}
BGT ARMNEON
POP {r4-r11}
Results:
Test | Cycles | Time (seconds) | Mbytes per second | Relative |
Word by Word memory copy | 52401776 | 0.104804 | 152.6665814 | 100% |
Load-Multiple memory copy | 47235011 | 0.09447 | 169.3658968 | 111% |
NEON memory copy | 52389453 | 0.104779 | 152.7024915 | 100% |
Word by Word memory copy with PLD | 68774347 | 0.137549 | 116.3224421 | 76% |
Load-Multiple memory copy with PLD | 53277011 | 0.106554 | 150.158574 | 98% |
NEON memory copy with PLD | 35102279 | 0.070205 | 227.9054303 | 149% |
Mixed ARM and NEON memory copy | 46742131 | 0.093484 | 171.1518031 | 112% |
Some of these results may be surprising.
The Load-multiple routine offers only 11% better performance, despite requiring far fewer instructions to execute and not as many branch instructions. The increase is limited because the processor will be achieving a 100% instruction cache hit ratio, so instructions can easily be fetched as quickly as they can be executed. Branch prediction will also work efficiently in this example, negating the effect of executing more branches. The merging write buffer also means that the memory system sees burst writes, in the same way as it would for single word writes.
The NEON memory copy routine has a few benefits, that are not shown in the performance of the copy itself. Firstly, the loop can execute without corrupting contents of many ARM (integer) registers. This would be of more benefit for a small memory copy, where the overhead of stacking / restoring these registers would be significant. The Cortex-A8 processor can also be configured (though it is not for these tests) to only allocate into the level-2 cache for NEON loads; this would prevent the memory copy routine replacing the existing data in the level-1 cache with data that will not be reused. However, the results show that the copy performance itself is the same as with ARM code.
A large gain is used by using the PLD instruction within the NEON code loop. This is because it allows the processor to issue the load for a future load to the memory system before it is required, meaning the memory controller can start accessing these locations early. The nature of SDRAM means that (with a suitable controller) having multiple requests to work on can hide the long latency of the first access in a burst. This routine is similar to that used by the ARM compiler for code generated for the Cortex-A8.
Factors affecting memory copy performance
Amount of data to be copied
Some implementations have an overhead to set up, but then copy data more quickly. When copying a large block of data, the speed increase in the main loop of the function will outweigh the extra time spent in the set up. For a small amount of data this will may not be the case. One example of this is stacking a large number of registers at the beginning of the function, to allow the use of LDM and STM instructions with many registers in the main loop of the function.
Alignment
Even with the unaligned access capabilities introduced in ARMv6, the ARM architecture provides better performance when loading quantities that are word aligned. There are also courser alignment granularities that can assist performance. A load multiple instruction on the Cortex-A8 can load 2 registers per cycle from the level 1 cache, but only if the address is 64-bit aligned. Cache behaviour (discussed later) can affect the performance of data accesses depending on its alignment relative to the size of a cache line. For the Cortex-A8, a level 1 cache line is 64 bytes, and a level 2 cache line is 64 bytes.
Memory characteristics
The performance of this function will be largely dependent only on the accesses to data memory. This is because it is likely to have a small inner loop, so the instruction accesses will cache well, and no calculations are being performed on the data so the processors arithmetic units will not be heavily loaded. Therefore the performance will vary greatly depending upon the speed of the memory. Certain types of memory system also perform better with certain types of access patterns than others – most notably SDRAM has a long latency for the initial access in a burst, but can very quickly supply the rest of the burst. With a suitable memory controller, the AXI bus will allow multiple outstanding requests to be serviced by the SDRAM controller in parallel, hiding much of the effect of this initial latency. However, some code sequences will make better use of this than others.
Cache usage
If a large amount of data is being copied, it is possible that this data will completely replace the existing data in the data caches. While this will have little effect on the performance of the memory copy itself, it may slow down subsequent code leading to an overall drop in performance.
Code dependencies
A standard memcpy() function, particular with slow memory, will result in the processor being unused for much of the time. It may therefore be more efficient to enable the processor to execute some other code in parallel in the memory copy. There are several ways of doing this, but will only be helpful if there is other work the processor can do which is not dependent on the memory copy having completed.
Other methods to copy memory
The C function memcpy() is blocking; once it is called it does not return until the memory copy is complete. This can result in the processor pipeline being idle for many cycles while waiting for memory accesses. It is often more efficient to perform the memory copy in the background, allowing the processor pipeline to be use for other tasks. However, this is likely to be more complex to control, needing to either poll for completion or handle an interrupt at completion. There are a few possible ways to achieve this:
Use a DMA engine
A DMA engine could copy the data either in one block or as a series of sub-blocks. This allows the processor to perform other tasks while the copy completes. Breaking the copy into sub-block would enable the processor to use some of the copied data before the entire transfer is complete.
Use the preload engine built into the Cortex-A8
The preload engine can be used to load a way of the level 2 cache with data. The processor can start this process, then perform other tasks until it receives an interrupt to say this load is complete. It can then start the preload engine off to store this data back to memory, and again run other tasks while this is completing.
Use another processor
For example, it is possible to use the DSP present on the OMAP 3530 to complete the memory copy, in a similar way to the DMA engine.