Data Source Latency-playmud-ChinaUnix博客

漂泊的程序人生playmud.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

playmud

博客访问： 15537697
博文数量： 112
博客积分： 11195
博客等级：上将
技术积分： 1989
用户组：普通用户
注册时间： 2005-06-20 11:04

文章分类

全部博文（112）

虚拟化（2）
有意思的算法（2）
storm（3）
hadoop（4）
python（5）
爱生活爱猪头（13）
个人收藏（4）
非安全（8）
网络技术（10）
Linux系统（40）
c/c++（20）
未分配的博文（1）

文章存档

2013年（2）

2012年（27）

2011年（6）

2010年（11）

2009年（6）

2007年（7）

2006年（23）

2005年（30）

我的朋友

相关博文

Data Source Latency

分类： LINUX

2012-09-25 14:05:40

Core i7 Xeon 5500 Series

Data Source Latency (approximate)

L1 CACHE hit, ~4 cycles

L2 CACHE hit, ~10 cycles

L3 CACHE hit, line unshared ~40 cycles

L3 CACHE hit, shared line in another core ~65 cycles

L3 CACHE hit, modified in another core ~75 cycles

remote L3 CACHE ~100-300 cycles

Local Dram ~60 ns

Remote Dram ~100 ns

There are additional issues to concern yourself with, in particular TLB miss.
Listed in above guide under DTLB. Where TLB is Translation Look Aside Buffer.
The TLB is a seperate very small cache of the virtual address to physical address mappings. On listed Core i7 Xeon Series this is 64 entries forprimary and 512 entries for secondary DTLB cache. I did not see mention of clock cycles impact when reference not in 64 entry DTLB primary. If a virtual memory reference is not mapped within the cached DTLBs you might suffer an additional 1 or 2 DRAM latencies.

Also, the above DRAM latencies are for the memory latency alone. Therefore, the latency to determine a cache miss may need to be added to the DRAM latency might get hidden. DRAM access tends to be pipelined so the cache miss latency. However.... when a specific thread experiences a cache miss, the DRAM request goes into a queue (16/12 deep depending on other thread memory requests). Therefore, a specific (worst case)request could have a latency of up to (3+16)*DRAM latency.

>>And the last things what methods are used to improve the succes hit ratio of caches

a) Structure data such that computationally related information resides within same cache line.
b) Write your algorithms such that the higher frequency of access occures on (near) adjacent memory. This reduces the TLB pressure.
c) Reduce number of writes to RAM through use of temporary varibles that can be registerized (optimized into register)
d) Structure data such that you can manipulate using SSE (when possible)
e) For parallel programming, coordinate activities amongst threads sharing cache levels. (HT siblings for L1 and L2, same die or half die for L3, same NUMA node for multiple nodes).

阅读(2978) | 评论(1) | 转发(0) |

上一篇：hugetlb

下一篇：IO虚拟化：虚拟设备队列VMDq技术解析

给主人留下些什么吧！~~

lthyxy2012-09-29 20:15:28

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6