1,cache机制简述
1.1 what is direct mapped / fully associative cache / N-way set associative?
The cache is subdivided into subsets of lines.
cache line 指在慢速的off-chip dram和快速的on-chip cpu cache间数据传输的最小单位,
一般使用burst mode传输.
1),At one extreme, the cache can be direct mapped , in which case a line
in main memory is always stored at the exact same location in the cache.
2),At the other extreme, the cache is fully associative , meaning that
any line in memory can be stored at any location in the cache.
3),most caches are to some degree N-way set associative , where any line
of main memory can be stored in any one of N lines of the cache. For
instance, a line of memory can be stored in two different lines of a
two-way set associative cache.
(整个cache包含多个set,每个set又包含N个cache line,即所谓的N-way)
direct mapped cache容易引起cached line的replacement,因为一个memory line只
可以存储在cache 中的一个位置; fully associative理论上最优,但实现难度大,
需要对每个cache line实现一个比较器.
实际的cache实现多使用N-way set associative,只需要对index对应的set中的N个
cache line实现并行的比较器.
1.2 如何映射memory address 到cache?
1.2.1 基本映射机制
memory address被分为以下几个部分:tag + index + offset_in_line.
index用于找到对应cache中的哪个set,一般使用取模计算:
set_no = index MOD (number of sets in the cache)
找出memory address对应哪个set后,然后用tag field和set中的每个way 对应的
tag 比较(这个比较使用硬件电路并行实现);如果找到匹配的项,说明cache hit,找到
memory address对应的cache line,否则表明cache miss.
如果cache hit,根据offset_in_line field即可找到在cache line中对应的数据;
否则,需要从dram中读取memory address对应数据到cache中.
1.2.2 实现中需要考虑的几个问题
1)physical address vs. virtual address
memory address使用physical address还是virtual address去访问cache呢?
virtual address --- not unique
.多个进程有同样的地址空间
.we’ll need to include a field identifying the address space in
the cache tag to make sure we don’t mix them up.
.The same physical location may be described by different addresses in different
tasks. In turn, that might lead to the same memory location
cached in two different cache entries (cache aliases)
physical address---
.A cache that works purely on physical addresses is easier to manage
(we’ll explain why below), but raw program (virtual) addresses are available
to start the cache lookup earlier, letting the system run that little
bit faster.
(physical address 只有在通过mmu对virtual address转化后才可以得到,会更慢一些)
2),Choice of line size:
When a cache miss occurs, the whole line must be filled from memory.
line size越大,产生数据读取和写入延时越大.
3),Split/unified:
I cache / D cache问题.
the selection is done purely by function, in that instruction
fetches look in the I-cache and data loads/stores in the D-cache. (This
means, by the way, that if you try to execute code which the CPU just
copied into memory you must both flush those instructions out of the
D-cache and ensure they get loaded into the I-cache.)
1.3 多级cache技术
许多cpu已经使用L1/L2/...cache.
使用多级cache技术的主要目的是 降低cache miss 所引起的 penalty.
2,程序设计中对cache问题的考虑
2.1 dma 操作
2.1.1 Before DMA out of memory
If a device is taking data out of memory, it’s
vital that it gets the right data. If the data cache is write back and a
program has recently written some data, some of the correct data may
still be held in the D-cache but not yet be written back to main memory.
The CPU can’t see this problem, of course; if it looks at the memory
locations it will get the correct data back from its cache.
So before the DMA device starts reading data from memory, any data
for that range of locations that is currently held in the D-cache must be
written back to memory if necessary.
2.1.2 DMA into memory
If a device is loading data into memory, it’s important
to invalidate any cache entries purporting to hold copies of the memory
locations concerned; otherwise, the CPU reading these localions will obtain
stale cached data. The cache entries should be invalidated before
the CPU uses any data from the DMA input stream.
2.2 Writing instructions
When the CPU itself is storing instructions into
memory for subsequent execution, you must first ensure
that the instructions are written back to memory and
then make sure that the corresponding I-cache locations
are invalidated: The MIPS CPU has no connection between
the D-cache and the I-cache.
2.3 linux slab allocator
linux slab cache包含多个slab, 所包含的多个slab用于分配和释放
同一类型的object(一般这些object使用同一个数据类型来定义).
linux对每个slab分配1到多个连续的物理page frame.
属于同一slab cache的多个slab中包含的位于同一offset的object,对应到
同一个cache line的概率非常大,至少说对应到cpu cache中同一个set的概率
非常大.
linux slab allocator使用了所谓的colour offset技术来避免这一问题.
尽可能对属于同一个slab cache的每个slab指定一个不同的colour offset,
这个colour offset决定了slab中第一个object的存储位置. 通过这种方法,
大大降低了上述的问题.
2.4 other
linux中许多数据结构定义都有类似的注释:
/*
* Keep related fields in common cachelines. The most commonly accessed
* field (b_state) goes at the start so the compiler does not generate
* indexed addressing for it.
*/
struct buffer_head {
/* First cache line: */
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers */
struct page *b_page; /* the page this bh is mapped to */
atomic_t b_count; /* users using this block */
u32 b_size; /* block size */
sector_t b_blocknr; /* block number */
char *b_data; /* pointer to data block */
struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another mapping */
};
有利于高效访问相关的数据域.