[Abstract]
本文要解决的一个问题是:把能够共享cache的线程尽量放到同一个chip上面,避免不共享cache的线程放在
同一个chip上面。那是怎么判别线程之间共享cache的呢?本文利用了PMUs,performance monitor Unit,
里面记录了线程在一个并行区域执行的时候产生的cache miss。同时,每个线程都会建立一张表shMap,里面记录
了产生cache miss的次数和logical address。最后,通过考察各个线程的shMap,就可以得出各个线程之间
的相似度,如果相似度达到一定的值,就认为这个2个线程可以在一个cluster。
[Detail]
Thread Clustering的好处:
A benefit of locating sharing threads onto the same chip is
that they incidentally perform prefetching of shared regionsfor each other. That is,
they help to obtain and maintain frequently used shared regions in the local cache.
为什么Thread Clustering的好处存在?
For the processing units that reside on the same CPU core, communication typically occurs
through a shared L1 cache, with a latency of 1 to 2 cycles.
For processing units that do not reside on the same CPU core but reside on the same chip,
communication typically occurs through a shared L2 cache, with a latency of 10 to 20 cycles.
Processing units that reside on separate chips communicate either by sharing memory or through
a cache-coherence protocol both with an average latency of hundreds of cycles.
进行Thread Clustering的条件:
Thread clustering will be activated only if the share of re-
mote cache accesses in the stall breakdown is higher than a
certain threshold.for every one billion cycles,
if 20% of the cycles are spent accessing remote caches, then
sharing detection phase is entered.
Thread Clustering分为4步:
1. Monitoring Stall Breakdown: Using HPCs, CPU
stall cycles are broken down and charged to differ-
ent microprocessor components to determine whether
cross-chip communication is performance limiting. If
this is the case, then the second phase is entered.
2. Detecting Sharing Patterns: The sharing pattern
between threads is tracked by using the data sampling
features of the hardware PMU. For each thread, a sum-
mary vector, called shMap, is created that provides a
signature of data regions accessed by the thread that
resulted in cross-chip communication.
3. Thread Clustering: Once sufficient data samples are
collected, the shMaps are analyzed. If threads have a
high degree of data sharing then they will have similar
shMaps and as a result, they will be placed into the
same cluster.
4. Thread Migration: The OS scheduler attempts to
migrate threads so that threads of the same cluster are
as close together as possible.
怎样发现各个线程的共享(Detecting Sharing Patterns)?
we monitor the addresses of the cache lines
that are invalidated due to remote cache-coherence activities
and construct a summary data structure for each thread,
called shMap. Each shMap shows which data items each
thread is fetching from caches on remote chips.later
compare the shMaps with each other to identify threads that
are actively sharing data and cluster them accordingly.
shMap本质是什么?
Each shMap is essentially a vector of 8-bit wide saturat-
ing counters.Each vector is given only 256 of these counters
so as to limit overall space overhead. Each counter corresponds
to a region in the virtual address space.
建立shMap的时候用到了2种策略:
Temporal Sampling 和 Spatial Sampling
Temporal Sampling,每N次访问远方的cache记录一次,N的值可以根据访问远方的频率和
运行时的开销。
Spatial Sampling,指不测量整个地址空间,仅测量某部分区域
怎样对各个线程的相似进行建模的?
线程T1和T2的相似度:similarity(T1, T2) = Sum(T1[i] ∗; T2[i]),0<= i <n;
where i is the ith entry of the vector Tx[ ]
[Contribute]
our scheme was able to reduce remote cache access stalls by
up to 70% and improve application performance by up to 7%.
|