multi-core paper notes-letmego163-ChinaUnix博客

涌泉闪闪@原创空间

首页　| 　博文目录　| 　关于我

letmego163

博客访问： 448399
博文数量： 123
博客积分： 2686
博客等级：少校
技术积分： 1349
用户组：普通用户
注册时间： 2009-12-23 22:11

文章分类

全部博文（123）

game（2）
mcore（1）
Virtual Address （1）
Page Coloring（1）
memory managemen（27）
Job（2）
configuration（1）
schedule（4）
common_kernel_AP（2）
kernel_patch（1）
多核研究（2）
Linux内核数据结（2）
U-boot-2009移植（8）
Linux内核源码（26）
Linux应用相关（11）
Linux驱动程序（11）
未分配的博文（21）

文章存档

2012年（3）

2011年（10）

2010年（100）

2009年（10）

我的朋友

周宏1990

相关博文

multi-core paper notes

分类： LINUX

2010-07-16 15:07:52

Paper 1:
autopin:Automated Optimization of Thread-to-Core Pinning on Multicore Systems

[Abstract] 这篇paper提出了这样一个思想：在一个共享存储空间的系统中，借助performance counters，自动检测，产生一个最优的线程和core的绑定序列。同时，论证了许多benchmark在运行时，对于thread/core绑定序列很敏感。 [Details] autopin实现原理：通过一个环境变量SCHEDULE,来配对线程和核，如，SCHEDULE=2367,表示thread 1被绑定到core 2,thread 2 to core 3, thread 3 to core 6, thread 7 to core 7.这样，user就可以通过SCHEDULE，程序调用sched_setaffinity()系统调用，来控制scheduling mappings。如果，user提供了多个mappings，autopin将探测每个mappings，算法如下： 1. 让程序初始化 i 秒。（避免程序在串行区域） 2. 读取当前时间戳 ts1，每个线程的performance counters的值 pc1 3. 让程序run t 秒 4. 读取当前时间戳 ts2，每个线程的performance counters的值 pc2 5. 计算每个线程的performance rate rj = (pc2-pc1)/(ts2-ts1)，和所有线程的performance rate的平均值avg(sum(rj)). 6. 如果还有mapping，让程序warming up w 秒后，回到步骤2.（warming up的作用是让各个core重新填充cache）。当每个mappings探测完了后，就可以根据每个mapping的performance rate平均值，从而选择一个最优的mapping 执行，只到这个程序终止。

[Contribute] 最优的mapping和autopin的差距在[-0.04%,2.71%], autopin和最差的mapping的差距在[1.48%,66.36%]

Paper 2:
Performance-Driven Processor Allocation

[Abstract] 这篇paper主要基于共享memory的多处理器系统，在不用用户程序参与的情况下，解决每个程序处理器分配的问题。在程序执行时，测量app的某些特征，利用这些特征再进行处理器的分配。主要由2部分来完成这个工作： SelfAnalyzer：动态分析程序的性能，如，(speedup, efficiency and execution time）。 PDPA：the Performance-Driven Processor Allocation，考察系统全局条件，和某个app的特征，分配cpu。具体来说，SelfAnalyzer分析了2个参数，p个线程的加速比和执行时间。 [Details] 实验环境：Origin2000 with 64 processors.All the benchmarks used in the evaluation are parallelized with OpenMP [OpenMP2000] directives。定义：加速比，s(P)=t(1)/t(P),t(1)一个processor的执行时间,t(P)是P个processors的执行时间。 Efficiency，E(P)=s(P)/P,意义是指平均每个processor贡献的加速比。 SelfAnalyzer工作过程：动态性能分析，SelfAnalyzer是一个运行时库，可以用来动态地计算一个并行区域的加速比，和估计整个程序的执行时间。有人做过研究，一个程序中比较耗时间的是多个循环区域，通过一个循环区域就可以估计其他的循环区域。测量加速比和执行时间过程如下，首先，执行一些最初的循环区域，获得执行时间 t(baseline),t(p),ConsumedTime，可以算出加速比s(P)。然后，估计出整个执行时间ExTime(p)=ConsumedTime + t(baseline)/s(p) * ItersRemaining. 为了测量这些时间，selfAnalyzer要在程序代码中发现如下指令：程序开始，循环的开始，每个并行循环的开始和结束。如果benmark代码可以获得，就可以直接在代码中插入，否则通过动态的检测。 PDPA工作过程： PDPA为应用程序指定了4个状态：NO_REF(initial state), DEC, INC, and STABLE。具体调度过程，参考Figure 4 。

每隔一段时间就要执行函数MoreProcessors() or LessProcessors(),以此来改变应用程序的状态。

MoreProcessors() { RelativeSpeedup=ExTime(LastAllocation)/ExTime(current) IncrementProcessors=current/LastAllocation if ( Efficiency(current)>=high_eff) && Speedup(current)>Speedup(LastAllocation) && RelativeSpeedup>=(IncrementProcessors*high_eff)) return TRUE else return FALSE } LessProcessors() { if (Efficiency(current)<low_eff) return TRUE else return FALSE } [Contribute] (1) a new approach to measure the speedup and the execution time of the parallel applications, the SelfAnalyzer, (2) a new scheduling policy that uses the speedup and the execution time to distribute processors, the Performance-Driven Processor Allocation (PDPA) policy, (3) and a new approach to coordinating the (medium-term) scheduler with the queueing system (long-term scheduler). our proposal can achieve a speedup of 3.9 with respect to an equipartition and 11.8 with respect to the native IRIX scheduling

Paper 3:
Thread Clustering:Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors

[Abstract] 本文要解决的一个问题是：把能够共享cache的线程尽量放到同一个chip上面，避免不共享cache的线程放在同一个chip上面。那是怎么判别线程之间共享cache的呢？本文利用了PMUs，performance monitor Unit，里面记录了线程在一个并行区域执行的时候产生的cache miss。同时，每个线程都会建立一张表shMap，里面记录了产生cache miss的次数和logical address。最后，通过考察各个线程的shMap，就可以得出各个线程之间的相似度，如果相似度达到一定的值，就认为这个2个线程可以在一个cluster。 [Detail] Thread Clustering的好处： A benefit of locating sharing threads onto the same chip is that they incidentally perform prefetching of shared regionsfor each other. That is, they help to obtain and maintain frequently used shared regions in the local cache. 为什么Thread Clustering的好处存在？ For the processing units that reside on the same CPU core, communication typically occurs through a shared L1 cache, with a latency of 1 to 2 cycles. For processing units that do not reside on the same CPU core but reside on the same chip, communication typically occurs through a shared L2 cache, with a latency of 10 to 20 cycles. Processing units that reside on separate chips communicate either by sharing memory or through a cache-coherence protocol both with an average latency of hundreds of cycles. 进行Thread Clustering的条件： Thread clustering will be activated only if the share of re- mote cache accesses in the stall breakdown is higher than a certain threshold.for every one billion cycles, if 20% of the cycles are spent accessing remote caches, then sharing detection phase is entered. Thread Clustering分为4步： 1. Monitoring Stall Breakdown: Using HPCs, CPU stall cycles are broken down and charged to differ- ent microprocessor components to determine whether cross-chip communication is performance limiting. If this is the case, then the second phase is entered. 2. Detecting Sharing Patterns: The sharing pattern between threads is tracked by using the data sampling features of the hardware PMU. For each thread, a sum- mary vector, called shMap, is created that provides a signature of data regions accessed by the thread that resulted in cross-chip communication. 3. Thread Clustering: Once sufficient data samples are collected, the shMaps are analyzed. If threads have a high degree of data sharing then they will have similar shMaps and as a result, they will be placed into the same cluster. 4. Thread Migration: The OS scheduler attempts to migrate threads so that threads of the same cluster are as close together as possible. 怎样发现各个线程的共享（Detecting Sharing Patterns）？ we monitor the addresses of the cache lines that are invalidated due to remote cache-coherence activities and construct a summary data structure for each thread, called shMap. Each shMap shows which data items each thread is fetching from caches on remote chips.later compare the shMaps with each other to identify threads that are actively sharing data and cluster them accordingly. shMap本质是什么？ Each shMap is essentially a vector of 8-bit wide saturat- ing counters.Each vector is given only 256 of these counters so as to limit overall space overhead. Each counter corresponds to a region in the virtual address space. 建立shMap的时候用到了2种策略： Temporal Sampling 和 Spatial Sampling Temporal Sampling,每N次访问远方的cache记录一次，N的值可以根据访问远方的频率和运行时的开销。 Spatial Sampling，指不测量整个地址空间，仅测量某部分区域怎样对各个线程的相似进行建模的？线程T1和T2的相似度：similarity(T1, T2) = Sum(T1[i] ∗; T2[i]),0<= i <n; where i is the ith entry of the vector Tx[ ] [Contribute] our scheme was able to reduce remote cache access stalls by up to 70% and improve application performance by up to 7%.

Paper 4:
Computation Spreading: Employing Hardware
Migration to Specialize CMP Cores On-the-fly

[Abstract] 本文要解决的问题是这样：一个程序的许多线程会表现出很多相似的计算，不同processor之间有非常多的代码重用，这样就产生了冗余；另一方面，很多不同的独立的计算单元之间会竞争cache等资源，会导致interference。本文提出的solution：CSP,computation spreading,就是说把不同线程之间不相似的执行单元分布到不同chip 的不同core上面去，把相似的执行单元组合到一起。线程的迁移是通过硬件来实现的。 [Detail] 通过对4个benchmark进行分析得出的特性： 1.不同core上面执行的代码有很大的相似性； 2.在一个core上面执行的互斥代码很少； 3.os比user的相似代码多； Thread Assignment Policy (TAP): 只把一个线程的user部分和kernel部分执行分开到不同的core上面执行。 Syscall Assignment Policy (SAP)：把os的系统调用分配到某些具体的core上面。 [Contribute] 1.考察了4个服务器应用程序代码重用的特性和相似计算单元在某些相似单元上面的协作特性； 2.提出了cmp的core分配策略； 3.demonstrate that TAP and SAP can reduce L2 instruction misses by 27–58%, while L2 load misses see a 0–19% decrease. Branch mispredictions are reduced by 9–25%, and on-chip interconnect bandwidth is re- duced by 0–12%. Overall, these policies result in a performance improvement of 1–20%.

Paper 5:
Interference Aware Cache Designs for Operating System Execution

[Abstract] 本文要的事情是这样：充分发挥off-loading的好处，减少它的坏处。本质而言，off-loading的好处在于减少了 cache的interference, 而坏处是the high latency costs for off-load and cache coherence。本文提出了OS cache per core: Our proposed solution employs a special OS cache per core and improves performance by up to 18% for OS-intensive workloads without any significant addition of transistors. [Detail] 在OS密集型的workload中，来自os的cache interference非常严重。为了解决这个问题， A few groups have argued that off-loading OS syscalls to a customized OS core can yield performance and power efficiency.We therefore attempt to design a dynamic off-load policy that is cognizant of syscall length and cache interference effects. 也就是说，只有当syscall的长度超过了某个值的时候才进行off-loading. To design a robust off-load mechanism, we need an estimation of that program’s optimal switch trigger and a hardware predictor that estimates the length of an OS system call. Migration的开销： On modern hardware, the minimal process migration cost is approximately 5,000-10,000 cycles. OS cache是什么样的？ our uni-processor design implements traditional split L1 instruction and data caches that are shared by both OS and User code. [Contribute] This option yields significant improvements: up to 53% if one is willing to increase the transistor budget(similar to the off-loading approach which adds a separate core), and up to 18%, while preserving the same transistor budget as the baseline.

阅读(1074) | 评论(0) | 转发(0) |

上一篇：SMT、CMP、SMP...

下一篇：multi-core dada record

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6