Please refer to
Thank you!
Basic Requirment
TSO model release the order requirement of W->R. A read can get the latest value of a local write no matter whether other CPU cores have obeserved it or not; a read can get the latest value of a remote write only when it is global visible. To guarantee this requirement, local load/store queue needs a forward logic for the read to get the written value; in level2 cache, there is a state when all the read are blocked waiting for invalidation returning. Others refer to "normal write operatons"
Coherence process
All the miss requests are managed in miss queue, which contains 8 items. The state machine is shown in figure1. All level2 cache requests are managed in a dirqueue, which contains 8 items. The state machine is shown in figure2.
figure 1
the miss request enters the miss queue, the miss queue entry enters the MISS state and issue a request to level2 cache, then enter the SCREF state and wait there until the data return. After getting the returning data, the miss queue entry enters RDY state when it starts to refill the level1 cache; after finishing, it returns to EMPTY state. The direct transistion between EMPTY and RDY is used by coherence request.
figure 2
1. Normal read operations
If level1 cache miss, CPU core issues a read request through imemread or dmemread bus to miss queue, which allocates an entry and enter to MISS state from EMPTY state. In MISS state, it issues a read request to level2 cache through the crossbar; After the axi response signal returns, it enters scmiss, where it waits for the data back. In level2 cache, scread bus is responsible for level1 cache request. The valid signal of scread bus change the dirqueue head from empty state to miss state, where the dirqueue issue read request to level2 cache(scache) ram an then proceeds to the scref state.
(1) If scache hit, and the directory state is clean. Then no CPU core has ever modified the block. It is safe the return the data. So the dirqueue enter dcrdy state and reaccess the scache ram and modify the directory to add the owner bit and then send back the data through dirqres bus. When miss queue receive the data, it enter rdy state and refill the level1 cache. Several cycles later, load/store queue noftifies the miss queue through dmemwrite bus making it enter the empty state. Then load/store queue retire the instruction (note that it commits much earlier).
(2) If scache hit, and directory state is dirty. The only differnce is that now it is necessary to send coherence message to level1 cache. This is done in the dcmod state. In the dcmod, the dirqueue read the directory that just read in the last state, and decide which CPU core to send the coherence message and which type to send. Typically, a read request would send a reqshare message to exclusive copy, a write request would send a reqinv or reqinvwtbk to shared cope or exclusive copy. Current configuration support up to 16 cores; the dirqueue sends the coherence messages one by one and wait there until all the response return. If there is any incorrect response, then resend it until no copy exist or correct response return. When receiving the coherence request, miss queue enter from empty state to rdy state and send to level1 cache through refill bus. Several cycles later the response go back to miss queue from load/store queue through dmemwrite bus. Then the miss queue return to empty and send response repXXX to scache. After receiving all correct responses, the dirqueue enter dcrdy and do the same thing as the above case.
(3) If scache miss, then dirqueue enters scref state. In this case, dirqueue reads out set number, state, directory from scache ram are the block to be replaced. It enters rep_wtbk state to prepare be replced(figure 3). At the same time, the dirqueue issues read requests to memory through the second level crossbar to the DDR3 controller.
figure 3
In the rep_wtbk state, the dirqueue does the same thing as dcmod: send invalidate or invalidate write back request to level1 cache, in order to remain the inclusion relation. Miss queue traverses through the empty-rdy path to deal with that. After that, dirqueue gets data from memory and enters scrdy state. Here dirqueue follows the 3 steps replace-refill-refill procedure to replace the original data and write write back the memory data to scache and modify the directory bits accordingly, then enter the dcrdy state, here the path merge with the hit path, data are sent back to level1 cache.
2. Normal write operations
Store miss goes through the similar path as load miss. In godson2, there is optimizaiton of non wirte allocation, where miss queue assemble store miss to form full modified block and write back to level1 cache. This mechanism reduce unnecessary scache read, but it will cause write-to-write order violation; in order to maintain processor consistency model in Godson3, this mechanism is abandoned. So store-to-store order is maintain by in order retirement in load/store queue. store-to-load is released, even a write is committed but not yet really written back and a subsequent load is finished, if the CPU core receives a inv or invwtbk of the same address of the load, the load is not necessary to be cancelled. But load-to-load order is still maintained, if there are wtbk or inv or invwtbk to the same address, the load and all subsequent instructions are cancelled.
(1) If hit, no matter the directory is clean or dirty, the dirqueue enters dcmod state, and make sure no other level1 copies exist. This is similar to the load process. After all miss queue send back correct responses, the dirqueue enters dcrdy state, where it modifies the scache directory through scref bus and send the data back. The only difference between clean and dirty is that the response is share or exclusive.
(2) If miss, dirqueue traverses through scmiss path, and return an exc copy to level1 cache.
DMA read/write operations I/O consistency is maintained in the same framework. DMA module just send requests through the crossbar to level2 cache, crossbar maintains the order and level2 cache maintains coherence.
For DMA read, it traverses through the normal read path. For the scrdy state, there is nothing to do here, just an empty state. For DMA write, if it hits, then it traverses through the normal miss path; if it misses, then traverses through the normal hit path. A DMA write hit requires extra coherence processing. The read out block is the block about to modify. Here we cheat it as a replaced block and traverse through the state machine as figure 3 shows, dirqueue sends out reqinv or reqinvwtbk to all level1 cache copies; after all the responses return, DMA directly communicates with scache through AXI protocol. It issues an reqinv request through ar channel. Dirqueue state machine traverses according to the hit/miss state to select next state. If miss, there is nothing inconsistent, just write back the original block and get data from DMA queue; if hit, reqinv or reqinvwtbk should be issued to level1 cache before refilling. After dealing with coherence messages, DMA refills the data in two aw channel communications; CPU cores have nothing to do with it in these two phrases.
For the last two types (normal/DMA), the replaced block in level1 cache enters miss queue through dmemwrite bus and traverses through the left part of figure 1. The dmemwrite bus is also responsible for notifying miss queue that the instruction is finished, the entry in miss queue can retire.
3. Cache instructions
cache instructions are typically divided into 3 types: inv, wtbkinv; load data/tag; store data/tag. Cache instruction traverses the same path as DMA write: hit go through the miss path while miss go through the hit path.
For the first type, miss queue issues the request to scache and enters scref state. If miss in scache, there is nothing more to do, just some cleaning jobs. If hit, go through the DMA hit path, invalidate or writeback_invalidate the block in all cache. Figure3 deals with level1 cache copy, while scrdy deals with scache coy. There is one difference here. When sending coherence message, the miss queue entry enters rdy state from scref state. It sends request to level1 cache through refill bus; response is sent back to miss queue from load/store queue and miss queue enters empty state after responsing to level2 cache. If the response is not mathc, this process may repeart until matching. After that, the block (cheated as replaced blocks) enters rep_rdy state and send request to scache ram (inv or wtbk_inv). After finishing, dirqueue nofity the level1 cache and enter emtpy state. Miss queue receive the nofitication and enter rdy from empty state. Here it send nofitication to load/store queue to retire the instruction and return to empty state after receiving response.
For the second type, miss queue traversed through the normal read path. For the third type, the write goes to level2 cache through dmemwrite bus and write buffer. They are similar as normal load/store instructions.
Confliction processing
Because there are several queues, so coherence processing can be incoherent because of the distributed queues. The general solution for this is to resend coherence request to level1 cache until it returns correct state response.
1. replaced block not yet arrive at level2 cache
In this case, level2 cache coherence request would return incorrect state response. For example, if the replaced block is on the way to scache, another core request for this block and send a reqread to it. Then the response would be the expected one as the orginal core does not contain the block any more but the directory is not yet modified. In this case, the request would be resent until the replaced block arrives.
This case contains those that the replaced cache block is required by other CPU core or by cache coherence request such as inv, invwtbk, wtbk, or scache replace (inclusion); all of them can be handled by resend the request until the replaced block arrive at the level2 cache and modify the directory.
2. refilled block not yet arrive at the requesting core
If there are outer requests for this block, the crossbar ensure that the outer request does not pass over the refill, so the data can be refilled back to level1 cache and then process it as other normal blocks. If the requesting core requires it another time, then the later request would be merge with the original one.
If the above two meet, that is a scache replace cause a level1 cache replace, the replace passes over a refill on the same block, then the scache would resend the request until the refill arrives and return correct response.
other two rules:
The same scache index accesses are searialized. Only when one of them exits from dirqueue, can the other be processed, the others all blocked in the miss state.
Merge the same address in the miss queue, write consume read but maintains the SHARE state request; when refilling, if the refilled data is SHD for store operation, then reissue the request.Other mismatches are delt with by resending coherence messages.Conflition contains many many types, the 4 rules discussed above cover all of them.
Major cache coherence maintanance is discussed in the "normal write operation" section
Implementation issues
Miss queue and dir queue are the two major parts in the whole framework; they are responsible for the whole cohernece communication between CPU core and scache.
Dirqueue communicates with scache ram by scacheref and scacheres bus, which sends scache request and get scache result respectively; dirqueue also communicate with miss queue (through crossbar) by dirref and dirres bus, which sends coherence request and sends back requested data to miss queue; smemread and scache_refill sends memory read request and get memory data respectively.
Dirqueue may send out reqread, reqwrite for normal read/write operations, op_replace and op_refill operations, op_lookup(i) and op_moddir operations for coherence handling. Dirqueue needs 4 write ports and 4 read ports: 1w for request from CPU core such as level1 cache miss and level2 cache instructions, when these requests enter dirqueue, dirqueue enters miss state. 2w for memory returning data, it only occurs after replaced blocks are in the rep_rdy state. 3w scache ram return informations. if scache miss, scache ram send the replaced block set number, directory to dirqueue; if hit, when modifying the directory, scache ram send the data to the dirqueue and dirqueue enters dcrdy state. 4w for all types of level1 cache cohrence responses. In his and dcmod state, all types of responses return and dirqueue enter dcrdy state; in miss, replaced block obtains all responses and enter rep_rdy state; cache instructions also need to send coherence requests to level1 cache. (rep_cnt maintains the number of returned responses) The first read port select one miss state dirqueue entry and issues requests to scache ram and then enter scref state, waiting for the scres bus become valid. Or it selects a scrdy state dirqueue entry and rep_rdy replaced block, and issues replace request to scache ram to replace and refill. The second read port issues memory read request to memory controller. The third read port issue cache coherence request to level1 cache. It selects those in dcmod state or scmiss and replaced block in rep_wtbk state and waits for all correct responses. The fourth port select dirqueue entries that are in dcrdy state to send back data to level1 cache.
Miss queue communicates with CPU core with: imemread and dmemread for icache and dcache miss; refill for level1 cache refill and replace; dmemwrite for data write back or coherence request finish notification. Scacheread for reading level2 cache and scachewrite for replacing back or writing back data to level2 cache. Dirres and dirref for geting data or coherence request from level2 cache.
The first write port for miss queue is from imemread or dmemread for level1 cache miss. Only when there are more than 1 empty entries can new miss requests enter the miss queue. This design is to avoid deadlock. If there are same address accesses, then they merge and only issue one request. For write operations merge read operation, refer to the dicussion above. In this case, miss queue enters miss state. The second write port is responsible for remote coherence request. Whenever there is empty entry, this type of request can enter and make the entry turn from empty to rdy. Then it sends request through refill bus to level1 cache and waits for response from load/store queue trhough the dmemwrite bus. The third write port get data from outside and starts replace and refill level1 cache. The fourth port receives from dmemwrite and imemwrite bus to know that cache coherence operations or data refill has already finished, then the miss queue return to empty state. The first read port issues read request to level2 cache and enters scref state. The second read port selects a entry in rdy state to refill level1 cache. It is a three steps procedure; it issues a OP_REPLACE followed by OP_REFILL operations; whem dmemwrite receives the last refill finish signal, miss queue enters empty state. To avoid deadlock, the last refill should be seperated by replace operations and refill should be serialized. For outer requests, it issue lookup(i) request to level1 cache and several cycles later issues wtbk(inv) to that set (read out above).