分类: LINUX
2011-03-14 22:06:01
存储入门文章(五)
名称:Aqueduct: online data migration with performance guarantees
出处:FAST 02.
作者:Chenyang Lu Guillermo A.Alvarez John Wilkes
单位:Department of Computer Science University of Virginia
Storage and Content Distribution DepartmentHewlett-Packard Laboratories
1.问题构想
The data to be migrated is accessed by client applications that continue to execute in the foreground in parallel with the migration. The inputs to the migration engine are a migration plan, a sequence of data moves to rearrange the data placement on the system from an initial state to the desired final state, and client application quality-of-service (QoS) demands – I/O performance specifications that must be met while migration takes place. Highly variable service times in storage systems (e.g., due to unpredictable positioning delays, caching, and I/O request reordering) and workload fluctuations on arbitrary time scales make it difficult to provide absolute guarantees, so statistical guarantees are preferable unless gross over-provisioning can be tolerated.The data migration problem is to complete the data migration in the shortest possible time that is compatible with maintaining the QoS goals.
要迁移的数据可能同时正在被客户端应用访问。
迁移引擎胡输入是一个迁移任务,一块连续的数据,客户端QoS要求。
由于不可预计胡定位延迟,缓存,和io request顺序导致变化太大的服务时间,使对迁移时间的估计非常困难。
目标就是保证Qos的情况下,尽可能在最短的时间内完成数据迁移。
2.Qos contracts(Qos 约定)
store是一个逻辑上连续的字节阵列,比如数据库表或是文件系统;G比特级的。stores被流访问,每一个store可能有一个或多个流。流的粒度是用定义都定义的,但一盘与特定应用相匹配。
Global QoS guarantees bound the aggregate performance of I/Os from all client applications in the system, but do not guarantee the performance of any individual store or application. They are seldom sufficient for realistic application mixes, for access demands on different stores may be significantly different during migration . On the other hand, stream-level guarantees have the opposite difficulty: they can proliferate without bound, and so run the risk of scaling poorly due to management overhead.
全局Qos保证定义了系统中所有客户端应用的总体io性能的一个边界,但不能保证单独store或应用的性能。这是不充分的,因为在实际系统中,由于在不同store中的访问需求在迁移过程中的特点是不同的。?????
An intermediate level, and the one adopted by Aqueduct, is to provide store-level guarantees. (In practice, this has similar effects to stream-level guarantees for our real-life workloads because the data-gathering system we use to generate workload characterizations cre-ates one stream for each store by default.) Let the average latency ALi of a store i in the workload be the average latency of I/Os directed to store i by client applications throughout the execution of a migration plan, and let the latency contract for store i be denoted LCi. The latency contract is expressed as a bounded average latency: it requires that ALi ≤ LCi for every store i.
Aqueduct 提供的是store层的保证。ALi指的是在迁移执行过程中store i的平均延迟。LCi指的是被定义的约定延迟LCi。ALi ≤ LCi 。
In practice, such QoS contract specifications may be derived from application requirements (e.g., based on the timing constraints and buffer size of a media-streaming server), or specified by hand, or empirically derived from workload monitoring and measurements.
Qos contract可能来源于应用需求,或是手动输入,或是负载监测和测量。
We also monitor how often the latency bounds are violated over shorter time intervals than the entire migration, by dividing the migration into equal-sized sam-pling periods, each of duration W. Let M be the number of such periods needed for a given migration. Let the sampled latency Li(k) of store i be its the average latency in the kth sampling period, which covers the time interval ((k-1)W, kW) since the start of the migration. We define the violation fraction VRi as the fraction of sampling periods in which QoS contract violations occur: VRi = |{k:Li(k) > LCi, k = 1,...,M}| / M.
W:时间间隔
M:指定迁移过程中分隔而成的这种间隔数
Li(k):在(k-1)W,kW之间的平均延迟
VRi:?
3.summary
It does so by dynamically adjusting the speed of data migration to maintain the desired QoS goals while maximizing the achieved data migration rate, using periodic measurements of the storage system's performance as perceived by the client applications. It guarantees that the average I/O latency throughout the execution of a migration will be bounded by a prespecified QoS con-tract.
通过动态调整迁移的速度来保证要求的Qos目标,在此前提下最大化数据迁移速率,定期测量(上面所提到的W)存储系统的性能。
The focus in this paper is on providing latency guarantees because (1) our early work showed that bounds on latency are considerably harder to enforce than bounds on throughput – so a technique that could bound latency would have little difficulty with throughput; and (2) the primary beneficiaries of QoS guarantees are customer-facing applications, for which latency is a primary criterion.
提供延时保证而不是吞吐量保证的原因:
1.?
2.面向用户的应用主要关心的是latency
4.Related work
老的迁移方法是在系统空闲的时候执行备份和数据迁移。
But existing logical volume managers (e.g., the HP-UX logical volume manager, LVM [22], and the Veritas Volume Manager, VxVM [27]) have long been able to provide continuing access to data while it is being migrated. This is achieved by creating a mirror of the data to be moved, with the new replica in the place where the data is to end up. The mirror is then silvered – the replicas made consistent by bringing the new copy up to date – after which the original copy can be disconnected and discarded. Aqueduct uses this trick, too. However, we are not aware of any existing solution that bounds the impact of migration on client applications while this is occurring in terms that relate to their performance goals. Although VxVM provides a parameter, vol_default_iodelay, that is used to throttle I/O operations for silvering, it is applied regardless of the state of the client application. High-end disk arrays (e.g., the HP Surestore Disk Ar-ray XP512) provide restricted support for online data migration [14]: the source and destination devices must be identical Logical Units (LUs) within the same array, and only global, device-level QoS guarantees such as bounds on disk utilization are supported. Some com-mercial video servers [16] can restripe data online when disks fail or are added, and provide guarantees for the specific case of highly-sequential, predictable multimedia workloads. Aqueduct does not make any assumptions about the nature of the foreground work-loads, nor about the devices that comprise the storage subsystem; it provides device-independent, application-level QoS guarantees.
当前的逻辑卷管理器如VxVM,LVM已经可以实现在迁移过程中对数据的不间断访问。都是通过创建要迁移的数据的复本来实现的,在目标位置创建数据的副本,然后再使数据更新,保持一致性,然后断开原来的数据,并抛弃。Aqueduct也使用这种方案。但是,还没有一种根据性能目标来限制迁移的方案。虽然VxVm提供了一个参数,但是他没有根据客户端应用的状态来采取合适的方案。高端磁盘陈列,仅仅提供对在线数据迁移的有限支持(源和目的端必须在同一个陈列中,而且仅仅设备级的Qos保证才被支持)。Aqueduct提供独立于设备的,应用级的Qos保证。
Existing storage management products such as the HP OpenView-Performance Manager can detect the presence of performance hot spots in the storage system when things are going wrong, and notify system administrators about them – but it is still up to humansto decide how to best solve the problem. In particular, there is no automatic throttling system that might address the root cause once it has been identified.
当前的存储系统如HP OpenView-Performance Manager可以察觉到存储系统性能的热点,并且通知管理员。但是还是需要要人工做出决定如何解决这个问题。而不是自动解决。
Although Aqueduct eagerly uses excess system resources in order to minimize the length of the migration, it is in principle possible to achieve zero impact on the foreground load by applying idleness-detection techniques [9] to migrate data only when the foreground load has temporarily stopped. Douceur and Bolosky [7] developed a feedback-based mechanism called MS Manners that improves the performance of important tasks by regulating the progress of low-importance tasks. MS Manners cannot provide guarantees to important tasks because it only takes as input feedback on the performance of the low-importance tasks. In contrast, Aqueduct provides performance guarantees to applications (i.e., the “important tasks”) by directly monitoring and controlling their performance.
Aqueduct利用大量的信息资源来减少迁移的时间。理论上可以通过空闲标查觉技术,来迁移前台负载不会访问的块。Douceue和Bolosky开发了一个基于反馈的机制来控制不重要的任务。但不能对重要任务作出保证。相反,Aqueduct通过监测和控制它们的性能来对应用提供性能保证。
There has been substantial work on fair scheduling techniques since their inception [23]. In principle, it would be possible to schedule migration and fore-ground I/Os at the volume manager level without relying on an external feedback loop. However, real-world workloads are complicated and have multiple, nontrivial properties such as sequentiality, temporal locality, self-similarity, and burstiness. How to assign relative priorities to migration and foreground I/Os under these conditions is an open problem. For example, a simple 1-out-of-n scheme may work if the foreground load consists of random I/Os, but may cause a much higher than expected interference if foreground I/Os were highly sequential. Furthermore, any non-adaptive scheme is unlikely to succeed: application behaviors vary greatly over time, and failures and capacity addi-tions occur very frequently in real systems. Fair scheduling based on dynamic priorities has worked reasonably well for CPU cycles; but priority computa-tions remain an ad hoc craft, and the mechanical prop-erties of disks plus the presence of large caches result in strong nonlinear behaviors that invalidate all but the most sophisticated latency predictions.
现在有大量的平衡调度技术。理论上,可以不依赖于feedback loop来在卷管理层来调度迁移和前台访问。然而实际的负载是复杂的,有多个属性如顺序、临时位置、自相似性、突发性。怎样在这些条件下应用相关属性是一个问题。例如,一个1-out-of-n机制,在一个负载主要是随机io时是可以工作,但是当负载是顺序的时候就不能很好的工作。而且,所有不是自适应机制的方法都是不可能成功的。
Recently, control theory has been explored in several computer system projects. Li and Nahrstedt [18] utilized control theory to develop a feedback control loop to guarantee the desired network packet rate in a distributed visual tracking system. Hollot et al. [11] applied control theory to analyze a congestion control algorithm on IP routers. While these works apply control theory on computing systems, they focus on managing the network bandwidth instead the performance of end servers.
当前控制理论已经在几个计算机系统中开发出来。Li和Nahrstedt利用控制理论开发了一个feedback控制环来保证在一个分布式的虚拟跟踪系统的需要的网络包率。Hollot应用控制理论来在Ip routers中做拥护控制。但它们的重点是在管理网络带宽而不是最终服务的性能。
5.Aqueduct
Aqueduct divides each store into small, fixed-size sub-stores that are migrated one at a time, in steps called submoves. This allows relatively fine control over the migration speed, as substores are relatively small: we chose 32MB as a reasonable compromise between man-agement overheads and control finesse. Nonetheless, despite the overheads, this implementation allowed us to evaluate the key part of the Aqueduct architecture – the feed-back control loop – which was the primary point of this exercise.
Aqueduct将每一个store分为小的固定大小的sub store,这是迁移的单位。选择32M,在管理负载和控制适合度之间做了平衡。feed-back control loop是最核心的结构
The Aqueduct monitor component is responsible for collecting the sampled latency of each store at the end of each sampling period, and feeding these results to the controller. We were able to extract this data directly from an output file periodically generated by our workload generation tool; but it could also have been obtained from other existing performance monitoring tools
monitor部分在每一个采样周期内收集每一个store的样本延迟。我们可以通过???
The controller compares the sampled latencies for the time window ((k-1)W, kW) with the QoS contract, and computes the submove rate Rm(k) (the control input) to be used during the next sampling period (kW, (k+1)W). Intuitively, Aqueduct should slow down data migration when some sampled store latencies are larger than their corresponding contracts, and speed up data migration when latencies are smaller than the corresponding con-tracts for all stores.
controller将样本延迟与Qos约定相比较,并且计算submove rate rm(k),这在后面的样本周期内将用到。当延迟高时,则将降低迁移速度,当延迟低时,则提高迁移速度。
1) For each store i (0 ≤ i < N) in the system, compute its error
Ei(k) = P∗LCi - Li(k),
where P (0 < P < 1) is a configurable parameter, and P∗LCi is called the reference in control theory. More negative values of Ei(k) represent larger latency violations.
对每一个store,延迟差=P*上一个采样点平均实际延迟-约定延迟
2) Find the smallest (i.e., potentially most negative) error Emin(k) among all stores:
Emin(k) = min{Ei(k)| 0 ≤ i < N};
thus taking account of the worst contract violation observed.
找到所有store,这个差最小的一个值
3) Compute the submove-rate according to the integral control function (K is another configurable parameter of the controller):
Rm(k) = Rm(k-1) + K∗Emin(k);
4) Notify the actuator of the new submove rate Rm(k).
整体的意思也就是:在所有的store中找到与要求延迟差别最大的值,然后以这个值为标准调整迁移速度。
Because the control input Rm(k) is computed from the Ei(k) corresponding to the worst violation, it forces the system to satisfy its latency goals by arranging for Emin to converge to zero. Thanks to random workload variations, store I/O latencies will typically oscillate around the reference value, so instead of choosing the actual latency target LCi as the reference, the controller uses a slightly smaller target: P∗LCi. The value of P is related to the burstiness of the workload: the more bursty a workload is, the smaller P should be, to give the controller enough leeway to avoid contract violations. On the other hand, overly small values of P will result in an overly conservative controller, and therefore slow down migration. In our experiments, we observed that a P between 0.8 and 0.9 was sufficient to achieve satisfactory violation fractions for significantly different workloads.
这种方法可以使延迟差别趋向于0.但由于随机负载的变化,io延迟可能在给定的值左右摆动。所以我们要加上一个P值。
Parameter K needs to be tuned to achieve stability (i.e., to prevent the submove rate and sampled latencies from oscillating excessively) and short settling time (i.e., fast convergence of the output to the reference). This can be done using systematic, standard control theory techniques. An example is provided in Section 5.1. A similar tuning method was described in detail, and applied to a real-time CPU scheduler in [20]. Aqueduct could be extended in a fairly straightforward way to set (and adjust) K automatically, using an on-line estimation of the gain [5] in order to handle different categories of workloads without the need for pre-computed parameter values.
动态设计k值(?)
The last module in Figure 1 is the actuator. It executes a migration plan at the submove rate computed by the controller. During the sampling period (kW, (k+1)W), the actuator enforces the submove rate Rm(k) by sleeping for (W/Rm(k) - Tj) time units between the end of submove j and the start of the next, where Tj is the time it took to complete submove j.
actuator通过延迟控制迁移速度。
6.未来研究方向
更好的与性能监测工具交互
更细粒度的迁移速度的控制
自适应不同的负载
应用一个新的control loop来同时控制延迟和变化粒度