分类: LINUX
2011-03-03 10:40:37
存储入门文章(四)
名称:--High Performance Multi-Node File Copies and Checksums for Clustered File Systems
出处:1994 IEEE.
作者:Paul Z. Kolano, Robert B. Ciotti
单位:NASA Advanced Supercomputing Division,NASA Ames Research Center, M/S 258-6 ,Moffett Field, CA 94035 U.S.A.
复制,校验工具
The standard cp and md5sum tools of GNU coreutils [11] found on every modern Unix/Linux system, however, utilize a single execution thread on a single CPU core of a single system, hence cannot take full advantage of the increased performance of clustered file system.
标准的cp和md5sum工具是GNU coreutils,利用单个系统的单个cpu,单个线程,因此不能充分利用clustered file system的性能
优化手段
Multithreading is used to ensure that nodes are kept as busy as possible. Read/write parallelism allows individual operations of a single copy to be overlapped using asynchronous I/O. Multi-node cooperation allows different nodes to take part in the same copy/checksum. Split file processing allows multiple threads to operate concurrently on the same file. Finally, hash trees allow inherently serial checksums to be performed in parallel.
多线程保证结点尽可能的处于忙状态。
读写并行保证对同一个copy的读和写可以通过异步io交叠进行
多节点合作可以使不同的结点参与同一个copy和校验
分割文件可以使多个线程在同一个文件上进行操作
哈希树。。。?
File Copy Optimization 1.Multi-Threaded Parallelism
The multi-threaded modifications to the cp command of GNU coreutils [11] utilize three thread types as shown in Figure 1 implemented via OpenMP [7]. A single traversal thread operates like the original cp program, but when a regular file is encountered, a copy task is pushed onto a shared task queue instead of performing the copy. Mutual exclusivity of all queues discussed is provided by semaphores based on OpenMP locks. Before setting properties of the file, such as permissions,the traversal thread waits until an open notification is receivedon a designated open queue, after which it will continue traversing the source tree.
mcp对cp命令的改进是,它采用了三种线程类型。一个单独的遍历线程相当于原来的cp程序。但不同的时,当遇到一个普通文件时,复制任务会放入一个共巷的任务队列,但不是马上执行这个copy任务。队列的排他性是通过基于OpenMP锁的信号量实现的。直到收到指定队列的通知时,才会设置文件的属性,然后继续遍历资源树。
One or more worker threads wait for tasks on the task queue. After it receives a task, each worker opens the source and target files, pushes a notification onto the open queue, then reads/writes the source/target until done. When stats are enabled, the worker pushes the task (with embedded stats) onto a designated stat queue and then waits for another task. The stat queue is processed by the stat thread, which prints the results of each copy task.
一个或多个worker threads等等待task queue中的task。当它接收到一个task时,每一个worker打开source 和 target文件,在open队列中加入一个通知,接着执行读源文件,写目标文件。当状态完成后,则将task(包含相关的状态)加入一个指定的状态队列,接着等待另一个任务。stat queue是通过stat thread来处理的,它将打印每一个copy task 的状态。
2.Single File Parallelization
(注:解决文件少的问题,不能充分利用线程)
One issue is that file copies generally exhibit poor buffer cache utilization since file data is read once, but then never accessed again. This increases CPU workload by the kernel and decreases performance of other I/O as it thrashes the buffer cache.
复制不能对buffer的利用率很低,因为文件只是读一次,这就增加了cpu的负载,不停的改写缓存内容降低了io性能。
The first approach is to use file advisory information via the posix_fadvise() function, which allows programs to inform the kernel about how it will access data read/written from/to a file. Since mcp only uses data once, it advises the kernel to release the data as soon as it is read/written. The second approach is to skip the buffer cache entirely using direct I/O. In this case, all reads and
writes go direct to disk without ever touching the buffer cache.
一种方法:利用posix_fadvise函数,通知内核在仅仅使用数据一次,用完之后马上从缓存清楚。另一种方法:是直接采用direct I/O不需要经过buffer cache
4.Read/Write Parallelism
Through the use of double buffering, it is possible to exploit additional parallelism between reads of one section and writes of another.
通过使用double buffering,可以开发这样的并行,read这一个扇区,而写另一个扇区。
The main difference is with the write of each file section. Instead of using a standard blocking
write, an asynchronous write is triggered via aio_write(), which returns immediately. The read of the next section of the file cannot use the same buffer as it is still being used by the previous asynchronous write, so a second buffer is used. During the read, a write is also being performed,
thereby theoretically reducing the original time to read each section from time(read) + time(write) to max(time(read), time(write)). After the read completes, the worker thread blocks until the write is finished (if not already done by that point) and the next cycle begins.
不同的是当写一个文件的扇区时,不是采用基准的写,而是通过aio_write白激发异步的写,立即返回。但下一个扇区的读不能用上一次异步写的缓存。
5.Multi-Node Parallelism
In the multi-node TCP model, one node is designated as the manager node and parcels out copy tasks to worker nodes. The manager node is the only node that runs a traversal thread and stat thread. Both types of nodes have some number of worker threads as in the multi-threaded case. In addition, each node runs a TCP thread that is responsible for handling TCP-related activities, whose behavior is shown in Figure 4. The manager TCP thread waits for connections from worker TCP threads. A connection is initiated by a worker TCP thread whenever a worker thread on the same node is idle. If the worker previously completed a task, its stats are forwarded to
the manager stat thread via the manager TCP thread.all cases, the manager thread pops a task from the task queue and sends it back to the worker TCP thread, where it is pushed onto the local task queue for worker threads.
在多结点TCP模型中,其中一个结点将被指定为manager node,并且分发copy任务到其它worker node。manager node是运行一个traversal thread和stat thread的唯一一个node
manager thread从任务队列中取出一个task,并将他发到worker TCP thread, worker TCP thread将它放入本地的task queue。不同类型的nodes都有一些worker thread。此外,每一个node都运行一个TCP thread来处理与TCP相关的活动。manager TCP thread将等待worker TCP threads的连接。每当一个node的worker thread空闲时,一个连接将会被worker TCP thread初始化。如果worker以前完成了这个task,它的状态将被通过manager TCP thread发送到manager stat thread。
在这种情况下,要考虑安全和一致性的问题。
File Checksum Optimization 1.Multi-Threaded Parallelism
The traditional approach to verifying integrity is to checksum the file at both the source and target and ensure that the values match. Checksums are inherently serial, however, so many of the techniques of the previous sections cannot be applied to any but the most trivial checksum algorithms.
传统的方法是在source和target端分别做校验,然后再验证一致性。?
2.Read/Hash Parallelism同read/write parallelism,采用双buffer,read与Hash并行
同copy,Multi-Node Parallelism
Verified File Copy Optimization
4.Buffer ReuseIn a typical integrity-verified copy, a file is checksummed at the source, copied, and then checksummed again at the destination to gain assurance that the bits at the source were copied accurately to the destination. This process normally requires two reads at the source since the
checksum and copy programs are traditionally separate so each must access the data independently. Adding checksum functionality into the copy portion eliminates one of the reads to increase performance. Mcp incorporates checksums for this reason. This processing is similar to Figure 5 except the buffer is written between the read and the hash computation.
在一个典型的一致性copy过程中,首先在source端做校验,然后复制,再在目标端做校验。这个过程需要两次读,因为校验过程和copy过程是分开的。将检验功能加入到copy部分可以减少一次read,提升性能。