CPU core optimization techniques-wuweidan-ChinaUnix博客

Power reduction
1. Clock line and register take up no more than 10% of the logic, but can consume up to 50% of the power. Clock line has a fixed switch frequency of twice of the clock frequency, while other logic has much lower switch frequency. At the same time, clock buffer consumes a substantial amount of power. Clock gating is a very powerful method to reduce power.
    On one hand, many logic gate has a very low active frequency, so it is necessary to gate the enable input; on the other hand, in a complex circuit like a CPU, it is impossible to use the same type of control logic for different types of gates.
    The solution is to customize the gate. The gating logic become a part of the gate's internal circuit. So the benefit of gating is still achieved while at the same time it is transparent to the physical design that is based on standard cell. (20%~60%)
    Statistics shows that the average enable activation rate is very low so this method is worth.
In real applications, multiple registers can be couple together to share the same enable input. Typical length could be 8, it is a typical byte length.
    The area is increased by the extra gating logic, but reduced by the elimination of multiplexer. So... There is a compartible issue for this method too.

2. Issue queue adopting the CAM+RAM architecture do not need to maintain and move the queue frequently, so it is more likely to save power consumption. Instruction just find an empty entry to store and issue when ready. Many researches[1,2,3,4] based on SimpleScalar＋Wattch are not suitable for CPU with static circuit.
    The main part of power consumption is the comparison logic (40%); the comparator number and the switch frequency of these comparator are both high. [6] isolated the non switching logic but it requires large isolating circuit.
    The solution is to statically reduce the number of comparator by characterizing the instruction by their operand readiness. A two operand instruction may have 3 types of situations: one operand ready, two ready and none. Statistics may give a average number of different types of instruction in the issue queue in every cycle. (4-6-6, 4-4-8) For different types of instructions, allocate different types of issue queue enteries for them. Two operands ready instructions do not have comparators while one operand ready instructions need one comparators in the issue queue.
    There are other types of characterization method available.
    The problem for this method is how to determine a queue is full, this signal is on the critical path and need to handle properly, while the above method introduce new complexity. The issue queue enteries with comparator can be used by entries with less comparators, but not vice verse.

3. RAM in issue queue contains all the fields that are not under comparison. Immediate number is the main part because of its long bit width. Statistics shows that it is safe to reduce the entries with immediate fields while do not harm the performance (10).

4. Many reserches exploit the internal circuit level of register file, there are some architecture level character to exploit.
Register read write contains power consumption from static leakage and dynamical switch. The former one is relatively constant depending on the technology, it is not related to the switch. The dynamical switch power can be divided into read and write, each type contains power consumption on word line and bit line. Word line decode consumes power is fixed. In every cycle, the word line selected are charged to 1 and then discharged to 0 in the later half cycle. But for the bit line, it is a little more complicated. For read, the bit line needs to be precharged and then when read out as 1, discharge it. The next time, we need to recharge it again. So bit line power consumption depends on the number of 1. For write, there is no precharge. When the current written data is different to the last time it is written, there is switch power consumption. When the new data is different from the orginal data stored, there is switch power too. So write bit line power consumption contains these two types.
    According to these characters, there are several ways to reduce the power. First, we can reduce the enable port number. For example, if the instruction requires an immediate number, then there is no need to oper two register ports as other instructions. A simple check of the instruction can reveal this thus lower the enable signal number(45.3%). For memory address generation, if the address remains the same as the last time, then there is no need the open the enable, and the output remains the same as last time. In this case, very little power is consumed (65.3%). Second, to reduce the number of 1, we can make the number of 1 that stored in the register minimum. If there are more 1 than 0 is about to store, revert it and record this speciality(20.4%, 21.9%; 5%).

5. Function units always contain multiple functions, for each instruction, only one function is needed. So power consumption can be reduced if those unused parts are isolated. Old methods... Through searching, we can find a input vector that requires the least power of a specific function unit. So combining these two methods can lower both dynamic and static power consumption.

Performance improvement
1. The current CPU core has 9 stages pipeline for a load instruction. From reservation station to tag compare where the cache access result comes out, there are 4 stages. So there are minimum 4 cycles load-to-use latency. If there are address dependence, the latency would be worse.
    Load speculation and speculative forwarding technique can reduce this latency. Very few load instructions depend on the previous store instructions. So when a load is issued before a previous store, we can speculatively allow the load to proceed without waiting for the store. (11%,7%)
    Supposed that there are 4 cycles from the reservation station to the tag compare unit, the following dependent instruction in the reservation station should wait 3 cycles (readreg, exe, readcache). If we speculatively notify it the load hit, then the instruction can proceed to the readreg stage and snoop the result there, removing one delay slot.

2. On write miss, the CPU core issues read request to level2 cache. If the cache block is about to be all modified, there is no need to fetch the original block.(50% SW do not need the original data) So on write miss, the request is issued to the miss queue and the write operation is performed and finished in the miss queue, and the item is waiting there for assembling into full modified cache block, when they can write back to the low level memory (non write allocate). When there are read, synchronization, miss queue is full, pick up randomly a non full modified item to fill the cache (write allocate). Read can be forwarded by the item in the miss queue without issuing low level memory access request. (5.9%) write non allocation
    But this extra copy of modified data imposed extra pressure on consistency maintanance. Now a write can be finished out of order of other operations. It is too hard to guarantee the original consistency model.
    These full modified blocks are less likely to be reused. So it is attempting to collect them in the miss queue, but less attempting to feed them back to cache.
On the other hand, write allocation has higher bandwidth utilization and more locality friendly.

3. Large register file needs many port which significantly increase the area and lantency. Decode circuit takes up 50% of the whole latency of the register file access time, while amplify circuit takes up to 30%. If the number of ports reduce, the latency reduce linearly. Split the register file and put them close to the function units, thus lower the latency. There is some extra area cost as now we have to maintain two exact copies.

4. split the cache read stage with the tag comparison stage, thus lower the cycle lantency. Customize the RAM macro design, eliminate the decode step of RAM access, thus remove the encode stage before getting to RAM. This is easier for physical design, but not compatible.

[1] B. Yu, R.I. Bahar. "A dynamically reconfigurable mixed in-order/out-of-order issue queue for power-aware microprocessors", IEEE Computer Society Annual
Symposium on VLSI, p.139-146, 2003.
[2] J. Abella, A. Gonzalez. "Low-complexity distributed issue queue", HPCA'2004.
[3] A. Buyuktosunoglu, S. Schuster, D. Brooks. "An adaptive issue queue for reduced power and high performance", In Workshop on Power-Aware Computer Systems, November 2000
[4] D. Folegnani and A. Gonzalez. "Energy-effective issue logic", ISCA'2001.
[5] D. Ponomarev, G. Kucuk, and K. Ghose. "Reducing power requirements of
instruction scheduling through dynamic allocation of multiple datapath resources",ISCA'2001.
[6] D. Folegnani and A. Gonzalez. Energy-effective issue logic, ISCA'2001
[7] J.L. Cruz, A. Gonz´alez, M. Valero, N.P. Tophanm. "Multiple-banked register file architectures", ISCA'2000
[8] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi. "Reducing the complexity
of the register file in dynamic superscalar processors", MICRO'2001
[9] A. Gonzalez, J. Gonzalez, and M. Valero. "Virtual-physical registers", HPCA'1998.
[10] T. Monreal, A. Gonz´alez, M. Valero, J. Gonz´alez, and V. Vinals. "Delaying
physical register allocation through virtual-physical registers", MICRO'1999
[11] S. Balakrishnan and G.S. Sohi. "Exploiting value locality in physical register files", MICRO'2003.
[12] M. Kondo, H. Nakamura. "A small, fast and low-power register file by
bit-partitioning", HPCA'2005.
[13] A. Correale. "Overview of the power minimization techniques employed in the
IBM PowerPC 4xx embedded controllers", International Symposium on Low Power Design, 1995.

阅读(486) | 评论(0) | 转发(0) |

上一篇：DMA Verification

下一篇：Basic component of data race detection on M5

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6