多线程应用中的BIS（整理总和之后）-xuyuanchao

XUYUANCHAO 教学博客xuyuanchao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

xuyuanchao_cnu

博客访问： 2229216
博文数量： 436
博客积分： 9833
博客等级：中将
技术积分： 5558
用户组：普通用户
注册时间： 2010-09-29 10:27

文章分类

全部博文（436）

10级实习与毕设（24）

bochs模拟器（6）

web远程管理（11）

cache模拟器（7）
南小院交流讨论区（99）
嵌入式操作系统20（0）
网络工程2010-201（39）
嵌入式操作系统与（108）
信息工程专业课程（12）
网络工程2011-201（28）
hadoop云计算专题（10）
Linux内核网络协（9）
谷歌云计算专题（2）
google Android （5）
google Android （13）
torque 3D游戏专（5）
torque 2D游戏专（8）
嵌入式网络协议专（8）
blog微博等专题（5）
恶意代码分析专题（10）
框计算与社会计算（3）
P2P专题（7）
未分配的博文（41）

文章存档

2013年（47）

2012年（79）

2011年（192）

2010年（118）

我的朋友

相关博文

多线程应用中的BIS（整理总和之后）

分类： LINUX

2013-01-29 14:30:44

1、引言

Our key idea is thus simple: measure the number of cycles spent by threads waiting for each bottleneck and accelerate the bottlenecks responsible for the highest thread waiting cycles.

我们主要的想法就是：测量线程等待每一个瓶颈的周期数，并加速负责最高线程等待周期的瓶颈。

This solution is too costly because (a) writing correct parallel programs is already a daunting task, and (b) serializing bottlenecks change with machine con?gu- ration, program input set, and program phase (as we show in Sec- tion 2.2), thus, what may seem like a bottleneck to the programmer may not be a bottleneck in the ?eld and vice versa.

这个解决方法代价很高因为（a）程序员写并行程序是一个艰巨的任务（b）一系列的瓶颈会随着机器配置、程序输入集、程序计划阶段改变而改变，所以是不是一个瓶颈不一定

The programmer, compiler or library delimits potential bot- tlenecks using BottleneckCall and BottleneckReturn instructions, and replaces the code that waits for bottlenecks with a Bottleneck- Wait instruction.

程序员利用BottleneckCall和BottleneckReturn指令分割潜在的瓶颈，用Bottleneck-Wait指令代替等待瓶颈的代码

The bottlenecks with the highest number of thread waiting cycles are selected for acceleration on one or more large cores. On executing a BottleneckCall instruction, the small core checks if the bottleneck has been selected for acceleration.

最高线程等待的周期的瓶颈被选择为加速在一个或多个大核上。在执行BC指令时小核检查瓶颈是否被选择为加速

How- ever, it only applies to barriers in statically scheduled workloads, where the work to be performed by each thread is known before runtime.

只适用于静态调度的障碍

3.3 加速瓶颈

BIS, consists of two parts: identi?cation of critical bottlenecks and acceleration of those bottlenecks.

BIS，包括两部分：识别临界瓶颈并且加速这些瓶颈。

Identi?cation of critical bottlenecks is done in hardware based on information provided by the software.

识别临界瓶颈是在软件提供的信息基础上在硬件上实现的。

There are multiple ways to accelerate a bottleneck, e.g. increasing core frequency, giving a thread higher priority in shared hard- ware resources, or migrating the bottleneck to a faster core with a more aggressive microarchitecture or higher frequency.

有许多方法加速瓶颈，例如提高核的频率，共享硬件资源给一个线程更高的优先权，或者把瓶颈移到有更积极的微体系结构建模或更高频率的核中。

问题：

1. However, these proposals lack generality and ?negrained adaptivity.中的finegrained 怎么理解（细粒）

The benefit of parallel computing comes from concurrent execution of work ; the higher the concurrency,the higher the performance.Every time threads have to wait for each other ,less work gets done in parallel ,which reduces parallel speedup and wastes opportunity. To maximize performance and increase concurrency,it is pivotal to minimize thread waiting as much as possible . （并行计算的好处来自于能更高的并发执行的工作。并发的越多，效果越好。线程之间会相互等待，这样可以降低并行加速，最大限度的提高性能，增加并发，以减少线程之间的等待。）

通过并行程序设计来大力的减少导致线程等待的瓶颈。

建议：Identification of critical bottlenecks and acceleration of those bottlenecks

Identification bottlenecks（识别瓶颈和加速瓶颈）

Identification bottlenecks

Identification of critical bottlenecks is done in hardware based on information provided by the software. （识别关键瓶颈是在软件提供的信息的基础上在硬件上完成的）

The hardware keeps track of the bottlenecks and which hardware threads they execute on .

代码执行瓶颈的修改:

——三个阶段：

Critical section : watch-addr can be the lock address

Barrier: it can the address of the counter of threads that have reached the barrier

Pipeline: it can be the address of the size of the queue

下面这个三个似懂非懂

BottleneckCall bid,target PC:

Marks the beginning of the bottleneck identified by bid and calls the bottleneck subroutine starting at target PC.

BottleneckReturn bid:

Marks the end of the bottleneck identified by bid and returns from the bottleneck subroutine.

BottleneckWait bid:

Waits for a maximum of timeout cycles for the content of memory address watch-addr associated with bottleneck bid to change ,while keeping track of the number of waiting cycles

调度缓冲区中的线程等待

Accelerate Bottlenecks 避免错误的序列化和starvation

抢先加速

调度缓冲区中的线程等待

A thread executing a Bottleneck Call instruction that was sent to the large core and is waiting on the SB also incurs thread waiting cycles that must be attributed to the bottleneck.(线程执行的瓶颈要先送到一个大内核，然后在调度缓冲区中等待受瓶颈影响的周期）

避免错误的序列化和starvation

所谓错误序列化即多个瓶颈可能安排在一个大内核上，一个瓶颈可能要等待比它优先级更高的瓶颈更长的时间，为此可能就会被安排在一个小内核上。

Starvation 到底指什么？不理解

抢先加速

在更新线程等待周期的时候，BT会检测出最高线程等待的瓶颈，这些瓶颈就会被运到大内核，然后发送一个抢先信号告诉小内核停止执行，将结构状态入栈，并告诉内核继续执行。内核将结构状态出栈，并恢复执行。（不理解这样为什么就加速了？）

修改三个方面就能Support for Multiple Large Core Contexts（支持多核环境）

1.Each large core has its own Scheduling Buffer

大内核有它自己的调度缓冲区、

2.Each bottleneck that is enabled for acceleration is assigned to a fixed large core context to preserve cache locality and avoid different large cores having to wair for each other on the same bottlenecks.

3 The preemptive mechanism is extended so that in case a bottleneck scheduled on small cores becomes the top bottleneck,and its number of executers is less than or equal to the number of large core contexts.the BT sends signals to preemptively migrate those threads to the large cores.

Implication Details

1.Tracking Dependent and Nested(嵌套的） Bottlenecks（跟踪相互依赖和嵌套的瓶颈）

Sometimes a thread has to wait for one bottleneck while it is executing another bottleneck.

Similiar situations occur when bottlenecks are nested.

The thread waiting cycles should be attributed to the bottleneck that is the root cause of wait(造成瓶颈的根本原因应该是线程等待周期）

判断瓶颈

To determine the bottleneck Bj that is the root cause of the wait for each bottleneck Bi，we need to follow the dependency chain between bottlenecks until a bottleneck Bj is found not to be waiting for a different bottleneck(要判断瓶颈Bj的根本原因是等待其他每个瓶颈Bi，就需要遵循一个瓶颈之间的依赖关系直到瓶颈Bj不再等待其他某个瓶颈）

To follow the dependency chain we need to know (a)which threads is executing a bottleneck and(b) which bottleneck that thread is currentlu waiting for.

To know (a)we add an executer_vec bit vector on each BT entry that records all current exeuters of each bottleneck.(不太理解）BT:即Bottleneck Table

To know (b),we add a small Current Bottleneck Table associated with the BT and indexed with hardware thread ID that gives the bid that the thread is currently waiting for.

处理中断

操作系统会中断内核。如果一个小内核在等待大内核执行瓶颈时被中断了，it（小内核）does not service the interrupt until a BottleneckDone or BottleneckCallAbort is received.（不能完全理解这两个名词）

If a large core gets an interrupt while accelerating a bottleneck,it aborts all bottlenecks in its Scheduling Buffer,finishes the current bottleneck,and then services the interrupt.

如果一个大内核在加速瓶颈时被中断了，它就会中止在调度缓冲区中所有的瓶颈，并结束目前的瓶颈去处理中断。

Transfer of Cache State to the Large Core(将缓存状态转移到内核）

A bottleneck executing remotelu on the large core mau require data that resides in the small core,thereby producing cache misses that reduce the benefit of acceleration.Data Marshalling has been proposed to reduce these cache misses,by identifying and marshalling the cache lines required bu the remote core.

在大内核上执行的瓶颈可能会用到存储在缓存中的数据，就会导致缓存遗漏，减少加速的好处。所以就提出用数据编组通过识别和编组缓存行来减少缓存遗漏。

Figure 8（a）主要关注的是iplookup在当我们控制相同的区域编预算环境的情况下，当large cores增加的同时线性串也会有所增加，同时他也可以更快速的计算同步的理解瓶颈。而且，在ASMP上的BIS也稳定的从一个大型核心的10%上升到两种功能的SMT核心的25%。另外一个例子是Figure 8（b），其中mysql-2额外的大核心会因为同时加速multiple critical sections 而来的好处来减少BIS或者MC-ACS的表现。

此部分分析得出的结论就是：恰当数量的large cores 主要对加速依靠改变线性数串的数量来解决瓶颈有着更重要的贡献，而且更好的表现也需要部分软件偏向于缝合瓶颈的状态才能达到杀两线性串和低等表现而获得高效表现的结果。

Figure 9主要体现出了在相同的区域编预算范围内，加速不同核心配置层顶基准点的几何意义。正常情况下，一个有八个小核心组成的区域编预算只能承受一个单独的ACMP大核心的配置，然而更大的核心编预算将表现为减少附加的大核心，主要原因在于额外剩余的大核心所带来的好处不足以补偿由于大量减少线性串而导致的失去平行性的表现。

当SMT应用于large cores的时候，会有两点表现：当本地数据频繁的进入驻存在电脑高速缓冲储存器中的阶段性通道和核临界截片，我们的工作量就会很少涉及到进入主存记忆，因此SMT在无序核心的主要优势就是很少被执行。其次表现在在每个线性串之间的紧密资源的分享方面。可总结为，利用SMT可以很有效的增加对于无用与BIS的加速瓶颈很有好处的large cores context的数量。

遇到问题：parallel performance具体指的是什么？

编预算具体指的是什么？

Heardward threads在区域环境中的作用是什么？

针对Figure 8，Figure 9在Multiple Large Core Context环境下与ACPM在单核情况下加速的情况进行了更详细的分析，通过这两个例子对于Multiple Large Core Context有了更深入的了解。

首先是对于figure 8(a)的分析，在iplookup 32核预算执行条件下，随着LC和SMT的增多，加速效果有着明显的效果。由于iplookup同时执行着在竞争中的多个决定性板块，所以Multiple Large Core可以有效地加速解决瓶颈问题。在ACPM基础上的BIS的执行速度也从一个大的单核的10%加速提高到了在large 2-way SMT core的 25%加速情况。如图所示：

Figure 8（b）则主要体现的是在mysql-2 additional large cores的环境下由于在大量减少线串的数量而决定性板块并没有十分激烈的竞争运行的情况下MC-ACS或BIS减速运行的情况。这主要体现了，最佳的加速情况和最佳数量的大核应该是适量的线串数量和瓶颈加速情况共同决定的。

Figure 9主要体现的是在相同核预算情况下不同核配置的加速情况的几何解释，核预算涉及范围从8个小核到62个小核。从之后的四个表格中可以很清楚的体现不同的核数量对于加速情况的影响。然而也可以得知：在大核上拥有SMT对于并不能很大程度上的改变BIS的执行表现。真正影响SMT在大核上的加速效果的有两个因素，这个在上周的分析中已经提到了。但是我们也可知最后的结论是，利用SMT来加速解决large corecontext环境下的瓶颈加速问题对于BIS来说并不是有很大的用处。

问题：equallycritical bottlenecks 是什么意思？

LC与SMT有着什么联系？

硬件配置的执行效果与线串数量和核数量的联系？

本文主要介绍BIS，一种软硬件合作的识别和加速瓶颈的机制。

BIS可以通过测量线程在每一个瓶颈处需要等待的周期数量来甄别到底是哪些瓶颈降低了性能，然后通过使用一个或者多个在非对称芯片多处理(ACMP)上的高速内核来加速这些瓶颈。

应用程序性能的瓶颈：临界段，障碍，缓慢的流水线阶段。因为这些部分都必须要序列化执行。

瓶颈：有线程竞争的代码段。

以前研究的不对称芯片多处理器并没有通用性和自适应性，只对部分瓶颈有加速作用。

加速临界段方法：只能加速一些特定临界段，却不能加速影响性能最大的临界段。

集合点方法：加速在并行区域中预期时间最长的线程，却只适用于静态规划。

基本法亏的流水程方法：通过分配线程给内核来实现平衡流水线上工作负载的阶段吞吐量和提高性能或减少功耗，但适应阶段不好，因为是基于软件实现的。

BIS方法：

在运行时，通过关键序列化瓶颈会使得其他线程等待最久这个特点来找到瓶颈，然后用一个或多个大的内核来加速他们。

软件方面：通过是同BottleneckCall 和BottleneckReturn指令来划定可能的瓶颈，用BottleneckWait指令来替换等待瓶颈的代码。

硬件方面：使用以上指令来测量每一个瓶颈导致的线程等待周期，然后用一个瓶颈来记录他们。有最高等待周期的瓶颈会被选出来加速。当执行一个BottleneckCall指令的时候，小内核会检查这个瓶颈是否已经被选择需要加速。若是则放入大内核调度缓存中，当执行瓶颈并遇到BottleneckReturn 指令时通知小内核。

无流水线下的负载ACS以及流水线下的负载FDP什么意思？？

瓶颈的例子：

①阿姆达尔的串行部分：当只有一个线程时，放在最大的内核上，其他内核闲置。

②临界段：在给定时间内的一个临界段只有一个线程可以执行。

③障碍：一个遇到障碍的线程必须等所有线程都达到这个障碍才能一起继续执行。

⑤流水线阶段：一个流水线并行程序中，循环被分成运行在不同线程上的不同阶段。加速最慢的阶段才可以提高性能。（图不是太懂，S2加速两倍，吞吐量为什么就平衡了？即使加速两倍他不还是应该要等待S3的执行的完成吗？）

瓶颈是如何随着时间变化的：

在运行时对程序进行成功的加速需要动态的识别哪一段代码是现在的关键瓶颈并用大内核加速它。（第二个图完全没看懂）

阅读(1405) | 评论(0) | 转发(0) |

上一篇：用户栈与内核栈

下一篇：北桥与南桥的区别

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6