A Task-centric Memory Model for Scalable Accelerator Architectures
John H. Kelm, Daniel R. Johnson, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel
Interesting points from the paper
Compute accelerator has "less stringent requirements of system software, are constrained by the need for low-overhead work dispatch, and are less beholden to legacy code". These types of application can execution in bulk; inside the interval, the majority of communication is read shared, only very few shared accesses are write. Most write need to be made global only at the end of the interval.
They use task-centric software approach to provide the illusion of global memory address but the communication can be done on chip. They also provide special operations for the global coherent memory. This software management system interacts with the hardware maintained cache allowing fine grained sharing and can exploiting locatlity provided by cache.
Hardware cache coherence which provides arbitrary sharing has marginal utility here as communicaton patterns are quite infrequent inside the bulk but only needed at the boudary of a interval. Of course, there is some requirement for certain data sharing mechanism. So accellerator architecture can maintain weak memory consistency model, with global memory oeprations and task based programming model to guarantee necessary coherence requirement.
1. Large amounts of immutable, read-shared data is present within an interval.
2. Synchronization is coarse-grained.
3. There exists only small amounts of write-shared data within an interval.
4. Fine-grained synchronization is present but rare, task management, not for application code.
5. wirte sharing are lmited in a few sharers.
Memory Model
Rigel uses a memory model different from traditional cache based or the streaming model[1]. It is incoherence software based. The model defines local load, local store, global load, global store, wrtie back from cluster cache to global cache and invalidate the cluster cache copy, similar to Munin.
Possible states for a memory block is clear(initial state), immutable, private and globally coherent. Clean state is the initial state and the medium of all other state transition, except for private dirty and private clean, which are implicitly guaranteed by hardware.
There is no need to gurantee ordering for clear blocks; private blocks only need to respect program order; global coherent blocks conform to processor consistency.
"The memory model defines that reads to private blocks followed by writes to globally coherent blocks from a single core respect program order. Reads to immutable or globally coherent blocks followed by writes to private blocks from a single core respect program order" (do not understand...)
........
"The lack of hardware-managed coherence inhibits nonbinding hardware prefetching at the cluster cache". The reason is that prefetch does not modify the directory, so there is no way for further invalidation.
Hardware prefetching at the global cache provides a large benefit in the best case and rarely
hurts performance measurably in the worst case.
The model is under software control, so it can deploy a mix model for cache cohernece action.
"First, eager writebacks overlap write traffic with useful execution and should be used as much as possible to increase memory system concurrency. The coherence actions result in less bursty load on the interconnect, increasing performance." "Second, lazy invalidation allows for shared read-input data to be exploited opportunistically when two tasks share read values and execute on the same core, or in the same cluster on Rigel, during an interval." (data...)
[1] Jacob Leverich. "Comparing Memory Systems for Chip Multiprocessors" ISCA'07
阅读(1017) | 评论(0) | 转发(0) |