1,cache机制简述 1.1 what is direct mapped / fully associative cache / N-way set associative?
The cache is subdivided into subsets of lines. cache line 指在慢速的off-chip dram和快速的on-chip cpu cache间数据传输的最小单位, 一般使用burst mode传输.
1),At one extreme, the cache can be direct mapped , in which case a line in main memory is always stored at the exact same location in the cache.
2),At the other extreme, the cache is fully associative , meaning that any line in memory can be stored at any location in the cache.
3),most caches are to some degree N-way set associative , where any line of main memory can be stored in any one of N lines of the cache. For instance, a line of memory can be stored in two different lines of a two-way set associative cache. (整个cache包含多个set,每个set又包含N个cache line,即所谓的N-way)
实际的cache实现多使用N-way set associative,只需要对index对应的set中的N个 cache line实现并行的比较器.
1.2 如何映射memory address 到cache? 1.2.1 基本映射机制 memory address被分为以下几个部分:tag + index + offset_in_line. index用于找到对应cache中的哪个set,一般使用取模计算: set_no = index MOD (number of sets in the cache)
virtual address --- not unique .多个进程有同样的地址空间 .we’ll need to include a field identifying the address space in the cache tag to make sure we don’t mix them up. .The same physical location may be described by different addresses in different tasks. In turn, that might lead to the same memory location cached in two different cache entries (cache aliases)
physical address--- .A cache that works purely on physical addresses is easier to manage (we’ll explain why below), but raw program (virtual) addresses are available to start the cache lookup earlier, letting the system run that little bit faster. (physical address 只有在通过mmu对virtual address转化后才可以得到,会更慢一些)
2),Choice of line size: When a cache miss occurs, the whole line must be filled from memory. line size越大,产生数据读取和写入延时越大.
3),Split/unified: I cache / D cache问题. the selection is done purely by function, in that instruction fetches look in the I-cache and data loads/stores in the D-cache. (This means, by the way, that if you try to execute code which the CPU just copied into memory you must both flush those instructions out of the D-cache and ensure they get loaded into the I-cache.)
1.3 多级cache技术 许多cpu已经使用L1/L2/...cache. 使用多级cache技术的主要目的是 降低cache miss 所引起的 penalty.
2,程序设计中对cache问题的考虑
2.1 dma 操作 2.1.1 Before DMA out of memory If a device is taking data out of memory, it’s vital that it gets the right data. If the data cache is write back and a program has recently written some data, some of the correct data may still be held in the D-cache but not yet be written back to main memory. The CPU can’t see this problem, of course; if it looks at the memory locations it will get the correct data back from its cache. So before the DMA device starts reading data from memory, any data for that range of locations that is currently held in the D-cache must be written back to memory if necessary.
2.1.2 DMA into memory If a device is loading data into memory, it’s important to invalidate any cache entries purporting to hold copies of the memory locations concerned; otherwise, the CPU reading these localions will obtain stale cached data. The cache entries should be invalidated before the CPU uses any data from the DMA input stream.
2.2 Writing instructions When the CPU itself is storing instructions into memory for subsequent execution, you must first ensure that the instructions are written back to memory and then make sure that the corresponding I-cache locations are invalidated: The MIPS CPU has no connection between the D-cache and the I-cache.
2.3 linux slab allocator linux slab cache包含多个slab, 所包含的多个slab用于分配和释放 同一类型的object(一般这些object使用同一个数据类型来定义). linux对每个slab分配1到多个连续的物理page frame. 属于同一slab cache的多个slab中包含的位于同一offset的object,对应到 同一个cache line的概率非常大,至少说对应到cpu cache中同一个set的概率 非常大. linux slab allocator使用了所谓的colour offset技术来避免这一问题. 尽可能对属于同一个slab cache的每个slab指定一个不同的colour offset, 这个colour offset决定了slab中第一个object的存储位置. 通过这种方法, 大大降低了上述的问题.
2.4 other linux中许多数据结构定义都有类似的注释: /* * Keep related fields in common cachelines. The most commonly accessed * field (b_state) goes at the start so the compiler does not generate * indexed addressing for it. */ struct buffer_head { /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ atomic_t b_count; /* users using this block */ u32 b_size; /* block size */
sector_t b_blocknr; /* block number */ char *b_data; /* pointer to data block */
struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ }; 有利于高效访问相关的数据域.