Chinaunix首页 | 论坛 | 博客
  • 博客访问: 439520
  • 博文数量: 123
  • 博客积分: 2686
  • 博客等级: 少校
  • 技术积分: 1349
  • 用 户 组: 普通用户
  • 注册时间: 2009-12-23 22:11
文章分类
文章存档

2012年(3)

2011年(10)

2010年(100)

2009年(10)

我的朋友

分类: LINUX

2010-07-16 15:07:52

Paper 1:
autopin:Automated Optimization of Thread-to-Core Pinning on Multicore Systems

[Abstract]
这篇paper提出了这样一个思想:在一个共享存储空间的系统中,借助performance counters,自动检测,
产生一个最优的线程和core的绑定序列。同时,论证了许多benchmark在运行时,对于thread/core绑定序
列很敏感。

[Details]
autopin实现原理:通过一个环境变量SCHEDULE,来配对线程和核,如,SCHEDULE=2367,表示thread 1被绑定到core 2,thread 2 to core 3, thread 3 to core 6, thread 7 to core 7.这样,user就可以通过SCHEDULE,程序调用sched_setaffinity()系统调用,来控制scheduling mappings。

如果,user提供了多个mappings,autopin将探测每个mappings,算法如下:
1. 让程序初始化 i 秒。(避免程序在串行区域)
2. 读取当前时间戳 ts1,每个线程的performance counters的值 pc1
3. 让程序run t 秒
4. 读取当前时间戳 ts2,每个线程的performance counters的值 pc2
5. 计算每个线程的performance rate rj = (pc2-pc1)/(ts2-ts1),和所有线程的performance
   rate的平均值avg(sum(rj)).
6. 如果还有mapping,让程序warming up w 秒后,回到步骤2.(warming up的作用是让各个core重新填充cache)。

当每个mappings探测完了后,就可以根据每个mapping的performance rate平均值,从而选择一个最优的mapping
执行,只到这个程序终止。

[Contribute]
最优的mapping和autopin的差距在[-0.04%,2.71%],
autopin和最差的mapping的差距在[1.48%,66.36%]



Paper 2:
Performance-Driven Processor Allocation

[Abstract]
这篇paper主要基于共享memory的多处理器系统,在不用用户程序参与的情况下,解决每个程序处理器分配的问题。在程序执行时,测量app的某些特征,利用这些特征再进行处理器的分配。主要由2部分来完成这个工作:
  SelfAnalyzer:动态分析程序的性能,如,(speedup, efficiency and execution time)。
  PDPA:the Performance-Driven Processor Allocation,考察系统全局条件,和某个app的特征,分配cpu。具体来说,SelfAnalyzer分析了2个参数,p个线程的加速比和执行时间。

[Details]
实验环境:Origin2000 with 64 processors.All the benchmarks used in the evaluation are parallelized with OpenMP [OpenMP2000] directives。

定义:加速比,s(P)=t(1)/t(P),t(1)一个processor的执行时间,t(P)是P个processors的执行时间。
     Efficiency,E(P)=s(P)/P,意义是指平均每个processor贡献的加速比。

SelfAnalyzer工作过程:
动态性能分析,SelfAnalyzer是一个运行时库,可以用来动态地计算一个并行区域的加速比,和估计整个程序的执行时间。
有人做过研究,一个程序中比较耗时间的是多个循环区域,通过一个循环区域就可以估计其他的循环区域。
测量加速比和执行时间过程如下,
首先,执行一些最初的循环区域,获得执行时间 t(baseline),t(p),ConsumedTime,可以算出加速比s(P)
然后,估计出整个执行时间ExTime(p)=ConsumedTime + t(baseline)/s(p) * ItersRemaining.

为了测量这些时间,selfAnalyzer要在程序代码中发现如下指令:程序开始,循环的开始,每个并行循环的开始和结束。如果benmark代码可以获得,就可以直接在代码中插入,否则通过动态的检测。

PDPA工作过程:
PDPA为应用程序指定了4个状态:NO_REF(initial state), DEC, INC, and STABLE。
具体调度过程,参考Figure 4 。

每隔一段时间就要执行函数MoreProcessors() or LessProcessors(),以此来改变应用程序的状态。

MoreProcessors()
{
RelativeSpeedup=ExTime(LastAllocation)/ExTime(current)
IncrementProcessors=current/LastAllocation
if ( Efficiency(current)>=high_eff) &&
Speedup(current)>Speedup(LastAllocation) &&
RelativeSpeedup>=(IncrementProcessors*high_eff)) return TRUE
else return FALSE
}
LessProcessors()
{
if (Efficiency(current)<low_eff) return TRUE
else return FALSE
}

[Contribute]
(1) a new approach to measure the speedup and
the execution time of the parallel applications, the SelfAnalyzer,
(2) a new scheduling policy that uses the
speedup and the execution time to distribute processors,
the Performance-Driven Processor Allocation (PDPA)
policy,
(3) and a new approach to coordinating the
(medium-term) scheduler with the queueing system
(long-term scheduler).

our proposal can achieve a speedup of 3.9 with respect to an equipartition
and 11.8 with respect to the native IRIX scheduling



Paper 3:
Thread Clustering:Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors

[Abstract]
本文要解决的一个问题是:把能够共享cache的线程尽量放到同一个chip上面,避免不共享cache的线程放在
同一个chip上面。那是怎么判别线程之间共享cache的呢?本文利用了PMUs,performance monitor Unit,
里面记录了线程在一个并行区域执行的时候产生的cache miss。同时,每个线程都会建立一张表shMap,里面记录
了产生cache miss的次数和logical address。最后,通过考察各个线程的shMap,就可以得出各个线程之间
的相似度,如果相似度达到一定的值,就认为这个2个线程可以在一个cluster。

[Detail]
Thread Clustering的好处:
A benefit of locating sharing threads onto the same chip is
that they incidentally perform prefetching of shared regionsfor each other. That is,
they help to obtain and maintain frequently used shared regions in the local cache.

为什么Thread Clustering的好处存在?
  For the processing units that reside on the same CPU core, communication typically occurs
through a shared L1 cache, with a latency of 1 to 2 cycles.
  For processing units that do not reside on the same CPU core but reside on the same chip,
communication typically occurs through a shared L2 cache, with a latency of 10 to 20 cycles.
  Processing units that reside on separate chips communicate either by sharing memory or through
a cache-coherence protocol both with an average latency of hundreds of cycles.

进行Thread Clustering的条件:
Thread clustering will be activated only if the share of re-
mote cache accesses in the stall breakdown is higher than a
certain threshold.for every one billion cycles,
if 20% of the cycles are spent accessing remote caches, then
sharing detection phase is entered.

Thread Clustering分为4步:
1. Monitoring Stall Breakdown: Using HPCs, CPU
stall cycles are broken down and charged to differ-
ent microprocessor components to determine whether
cross-chip communication is performance limiting. If
this is the case, then the second phase is entered.

2. Detecting Sharing Patterns: The sharing pattern
between threads is tracked by using the data sampling
features of the hardware PMU. For each thread, a sum-
mary vector, called shMap, is created that provides a
signature of data regions accessed by the thread that
resulted in cross-chip communication.

3. Thread Clustering: Once sufficient data samples are
collected, the shMaps are analyzed. If threads have a
high degree of data sharing then they will have similar
shMaps and as a result, they will be placed into the
same cluster.

4. Thread Migration: The OS scheduler attempts to
migrate threads so that threads of the same cluster are
as close together as possible.

怎样发现各个线程的共享(Detecting Sharing Patterns)?
we monitor the addresses of the cache lines
that are invalidated due to remote cache-coherence activities
and construct a summary data structure for each thread,
called shMap. Each shMap shows which data items each
thread is fetching from caches on remote chips.later
compare the shMaps with each other to identify threads that
are actively sharing data and cluster them accordingly.

shMap本质是什么?
Each shMap is essentially a vector of 8-bit wide saturat-
ing counters.Each vector is given only 256 of these counters
so as to limit overall space overhead. Each counter corresponds
to a region in the virtual address space.

建立shMap的时候用到了2种策略:
Temporal Sampling 和 Spatial Sampling
Temporal Sampling,每N次访问远方的cache记录一次,N的值可以根据访问远方的频率和
运行时的开销。
Spatial Sampling,指不测量整个地址空间,仅测量某部分区域

怎样对各个线程的相似进行建模的?
线程T1和T2的相似度:similarity(T1, T2) = Sum(T1[i]; T2[i]),0<= i <n;
where i is the ith entry of the vector Tx[ ]

[Contribute]
our scheme was able to reduce remote cache access stalls by
up to 70% and improve application performance by up to 7%.



Paper 4:
Computation Spreading: Employing Hardware
Migration to Specialize CMP Cores On-the-fly


[Abstract]
本文要解决的问题是这样:一个程序的许多线程会表现出很多相似的计算,不同processor之间有非常多的代码
重用,这样就产生了冗余;另一方面,很多不同的独立的计算单元之间会竞争cache等资源,会导致interference。

本文提出的solution:CSP,computation spreading,就是说把不同线程之间不相似的执行单元分布到不同chip
的不同core上面去,把相似的执行单元组合到一起。线程的迁移是通过硬件来实现的。

[Detail]
通过对4个benchmark进行分析得出的特性:
1.不同core上面执行的代码有很大的相似性;
2.在一个core上面执行的互斥代码很少;
3.os比user的相似代码多;

Thread Assignment Policy (TAP):
只把一个线程的user部分和kernel部分执行分开到不同的core上面执行。

Syscall Assignment Policy (SAP)
把os的系统调用分配到某些具体的core上面。

[Contribute]
1.考察了4个服务器应用程序代码重用的特性和相似计算单元在某些相似单元上面的协作特性;
2.提出了cmp的core分配策略;
3.demonstrate that TAP and
SAP can reduce L2 instruction misses by 27–58%, while L2
load misses see a 0–19% decrease. Branch mispredictions are
reduced by 9–25%, and on-chip interconnect bandwidth is re-
duced by 0–12%. Overall, these policies result in a performance
improvement of 1–20%.



Paper 5:
Interference Aware Cache Designs for Operating System Execution

[Abstract]
本文要的事情是这样:充分发挥off-loading的好处,减少它的坏处。本质而言,off-loading的好处在于减少了
cache的interference, 而坏处是the high latency costs for off-load and cache coherence。

本文提出了OS cache per core:
Our proposed solution employs a special OS cache per core and improves performance by
up to 18% for OS-intensive workloads without any significant addition of transistors.

[Detail]
在OS密集型的workload中,来自os的cache interference非常严重。为了解决这个问题,
A few groups have argued that off-loading OS syscalls to a customized OS core can
yield performance and power efficiency.We therefore attempt to design a dynamic off-load
 policy that is cognizant of syscall length and cache interference effects.
也就是说,只有当syscall的长度超过了某个值的时候才进行off-loading.

To design a robust off-load mechanism,
we need an estimation of that program’s optimal switch trigger and a hardware predictor
that estimates the length of an OS system call.

Migration的开销:
On modern hardware, the minimal process migration cost is approximately 5,000-10,000 cycles.

OS cache是什么样的?
our uni-processor design implements traditional split L1 instruction and data
caches that are shared by both OS and User code.

                
[Contribute]
This option yields significant improvements: up to 53% if one is willing to increase the
transistor budget(similar to the off-loading approach which adds a separate core), and up
to 18%, while preserving the same transistor budget as the baseline.


阅读(1047) | 评论(0) | 转发(0) |
0

上一篇:SMT、CMP、SMP...

下一篇:multi-core dada record

给主人留下些什么吧!~~