DMA Verification-wuweidan-ChinaUnix博客

The DMA module is responsible for the direct communication between I/O and level2 cache; because level2 cache already maintain consistency, DMA module needs only to send out proper requests through crossbar to the level2 cache. The major part of DMA verification lies on the I/O consistency verification, which is tightly coupled with the L2 cache operation. The verification needs to cover all types of possible communication between them to make sure the correct functionality. As the module communication with level2 cache(through crossbar as the figure shows) by AXI protocol, this test is also a verification on AXI test, especially the cache coherence enhancement on original AXI protocol.

DMA read write operations
DMA read is a path to allow the data to flow from I/O to crossbar. The DMA queue has 8 items and can communicate with crossbar with any protocol like AXI. When there is a read request, the DMA module make sure it does not exceed 256bits and 256bits aligned; then it sends out the request to crossbar. 512bits is allowed for I/O device but DMA module splits it into 2 requests.
DMA write needs to send out coherence request first, as this write needs to invalidate level1 cache copies first. DMA write can first send out this copy through a read channel (ar) and wait for response from the r channel. After that send the request through normal write channels.

Request of AXI Master and Replies of AXI Slave
Reason	Request	AXI Channel	Replies	Axi Channel
L1 cache miss of load or fetch	Reqread	AR	Repread	R
L1 cache miss of store	Reqwrite	AR	Repwrite	R
Replacing L1 cache	Reqreplace	AW+W	Repreplace	B
DMA read request	Dmaread	AR	Repread	R
DMA write maintain cache coherence	Prewrite	AR	Preresp	R
DMA write request	Dmawrite	AW+W	Wresp	B
Request of AXI Slave and Replies of AXI Master
Reason	Request	AXI Channel	Replies	Axi Channel
Invalidating L1 cache	Reqinv		Repinv	AW+W
Writing back L1 cache	Reqwtbk	R	Repwtbk	AW+W
Invalidating and writing back L1 cache	Reqinvwtbk	R	Repinvwtbk	AW+W

Detailed information about DMA coherence is already described in Implementation of consistency model.

Verification method
In out CPU, HT module contains the DMA module; in other word, the HT module is the high speed I/O interface including DMA function, shown in the figure below. As described above, the DMA is responsible for I/O and Level2 cache communication. Using HT_BFM module (Bus Functional Model) to generate DMA read/write request, the verification is divided into 3 types: sequence test, unaligned test and random test. Sequence test is used to test the peak bandwidth of read write operation. I send out read write alternative request on sequential address range, maximizing the transferring rate of DMA module. Unaligned test explores the fact that those accesses that not 256bits aligned would be split into 2 accesses, thus increases the amount of transportation and test the correctness. Random test further impose pressure on the correctness and transportation rate, as he BFM module randomly generate request on different addresses with different types and lengths. For all these three types, the DMA sends or receive data from the Level2 cache, while the CPU acts as the counter part. When DMA finishes write operation, it notifies the CPU by issuing an interrupt, the CPU then response to the interrupt and read in the data; when the CPU finishes write operation, it writes the GPIO to notify the DMA module to read the data. They operate on the same range of address alternatively.
However, for the random test, I need to record every randomly generated address and data for future check. The HT_BFM does not inherently support this function, so I attached a small ram in the DMA bus, configure the CPU to assign an address range on it. Whenever a request is sent out from DMA module, the information is loaded into the small ram before going to the level2 cache. In the checking module, both cache and the small ram are read out to compare the result, as the figure below shows. Doing this requires to add this small ram to the system address space. It is easy to find an unused address range; in order to use this range, the system needs to be reconfigured by the Config Register module shown at the bottom of the figure below. After opening the address range, this range is assigned to HT module, thus when CPU sends out a request on this range, it would be directed to the HT port; remember that the DMA is inside the HT controller module. The ram is attached on the bus, like a bus monitor. It intercepts the request and consume it, avoiding this request to further directed to the HT module. Thus the operations described above can be implemented.
The process above just test the operation in normal cases. In order to further test the correctness of coherence protocol, every type of test is combined with CPU disturbance, which put more pressure on level2 cache. There are 4 CPU cores in the system, one is responsible for communication with DMA, the others send out request to the same cache index to cause cache replace. According to the configuration of cache, there are 4 level2 cache block, each one is 1MB. Each one is 4 way set associative and cache line is 32bytes. So it is easy to find out that the addresses with the same bit from 5 to 19 would be placed in the same block, the same way and the same line.

Verification result:
(1)find a DMA bug in the late design stage;
(2)HT module does not work on full capacity, further optimize the HT controller.

Some thinking on DMA
Traditionally, DMA is used for improving the I/O performance of the system; with DMA the I/O process can be overlapped with the CPU computation. Current DMA has sufficient bandwidth and acceptable latency for I/O operation; it is already fast enough. It queues up I/O requests and directly communicate with level 2 cache; it interrupt CPU only when necessary. The coherence maintenance is only performed in level2 cache and crossbar, while DMA is relatively simple. This is sufficient for an I/O accelerator.
But DMA is a far more broad mechanism. For streaming programming, the DMA should take up more responsibility. It should organize the request of CPU access more effectively; the streaming program may request arbitrary length of data, as the computation kernel can be arbitrary complex. For example, the Cell processor can support data block up to 16kb; the DMA engineer should be able to manage these types of access rather than only perform on cache line. For Cell, the load and store instruction can only operation on local memory, so the DMA should be fully cache coherent. It should also support gather-and-scatter operation, which can gather randomly distributed accesses in one DMA operation. Thus the schedule algorithm for DMA is critical. Note that on such a platform we are more interested in exploring the data level parallelism, several streaming scheduling algorithm have been proposed; traditional scheduling algorithm focusing on instruction level parallel, such as space-time scheduling or list scheduling, help fewer in this place. But within the framework of streaming scheduling, such ILP scheduling can be incorporated to enhance the performance inside the kernel.

[1] E. Nystrom, A. E. Eichenberger, "Effective cluster assignment for modulo scheduling" MICRO'98.
[2] V. Lapinskii et al., "Cluster assignment for high-performance embedded VLIW processors" ACM TODAES'02.
[3] P. Mattson, "A Programming System for the Imagine Media Processor" Ph.D. Thesis, EE Dept., Stanford University, 2001.
[4] W. J. Dally et al., “” ACM Queue'04

阅读(605) | 评论(0) | 转发(0) |

上一篇：A small CPU——EC2

下一篇：CPU core optimization techniques

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6