Linux relayfs调研笔记-wxju168-ChinaUnix博客

举世无双的学习之路

首页　| 　博文目录　| 　关于我

wxju168

博客访问： 1972752
博文数量： 383
博客积分： 10011
博客等级：上将
技术积分： 4061
用户组：普通用户
注册时间： 2008-04-24 18:53

文章分类

全部博文（383）

我的原创（0）
浪潮之巅（5）
求职之路（14）
社会百态（0）
Linux系统编程（6）
命理学（0）
网络知识（1）
Linux Socket（10）
解答系列（3）
养生之道（6）
程序设计（25）
Linux应用（41）
Linux内核学习（18）
嵌入式（7）
Linux调试和跟踪（12）
TeX（0）
Linux内核编程（46）
电脑应用（15）
RTAI（2）
学习心得（2）
Linux驱动程序（15）
QT编程（7）
生活历程（77）

心情日记（11）

经典收藏（26）

感悟收藏（40）
Linux学习（70）
未分配的博文（1）

文章存档

2011年（1）

2010年（9）

2009年（276）

2008年（97）

我的朋友

相关博文

Linux relayfs调研笔记

分类： LINUX

2008-12-24 10:15:42

Linux数据传输技术relay的原理

relay 是一种从 Linux 内核到用户空间的高效数据传输技术。通过用户定义的 relay 通道，内核空间的程序能够高效、可靠、便捷地将数据传输到用户空间。relay 特别适用于内核空间有大量数据需要传输到用户空间的情形，目前已经广泛应用在内核调试工具如 SystemTap中。

relay 要解决的问题

对于大量数据需要在内核中缓存并传输到用户空间需求，很多传统的方法都已到达了极限，例如内核程序员很熟悉的printk() 调用。此外，如果不同的内核子都开发自己的缓存和传输，造成很大的冗余，而且也带来维护上的困难。

这些，都要求开发一套能够高效可靠地将数据从内核空间转发到用户空间的，而且这个应该独立于各个调试子。这样就诞生了 relayFS。

relay的发展历史

relay 的前身是 relayFS，即作为 Linux 的一个新型文件。2003年3月，relayFS的第一个版本的被开发出来，在7月14日，第一个针对2.6内核的版本也开始提供。经过广泛的试用和改进，直到2005年9月，relayFS才被加入mainline内核(2.6.14)。同时，relayFS也被移植到2.4内核中。在 2006年2月，从2.6.17开始，relayFS不再作为单独的文件存在，而是成为内核的一部分。它的源码也从fs/目录下转移到 kernel/relay.c中，名称中也从relayFS改成了relay。

relayFS目前已经被越来越多的内核工具使用，包括内核调试工具SystemTap、LTT，以及一些特殊的文件，例如DebugFS。

relay的基本原理

relay提供了一种机制，使得内核空间的程序能够通过用户定义的relay通道(channel)将大量数据高效的传输到用户空间。

一个relay通道由一组和CPU一一对应的内核缓冲区组成。这些缓冲区又被称为relay缓冲区(buffer)，其中的每一个在用户空间都用一个常规文件来表示，这被叫做relay文件(file)。内核空间的用户可以利用relay提供的API接口来写入数据，这些数据会被自动的写入当前的 CPU id对应的那个relay缓冲区；同时，这些缓冲区从用户空间看来，是一组普通文件，可以直接使用read()进行读取，也可以使用mmap()进行映射。Relay并不关心数据的格式和内容，这些完全依赖于使用relay的用户程序。relay的目的是提供一个足够简单的接口，从而使得基本操作尽可能的高效。

relay将数据的读和写分离，使得突发性大量数据写入的时候，不需要受限于用户空间相对较慢的读取速度，从而大大提高了效率。relay作为写入和读取的桥梁，也就是将内核用户写入的数据缓存并转发给用户空间的程序。这种转发机制也正是relay这个名称的由来。

这里的relay通道由四个relay缓冲区(kbuf0到kbuf3)组成，分别对应于中的cpu0到cpu1。每个CPU上的调用relay_write()的时候将数据写入自己对应的relay缓冲区内。每个relay缓冲区称一个relay文件，即/cpu0到 /cpu3。当文件被mount到/mnt/以后，这个relay文件就被映射成映射到用户空间的地址空间。一旦数据可用，用户程序就可以把它的数据读出来写入到硬盘上的文件中，即cpu0.out到cpu3.out。

relay的主要API

1、 面向用户空间的API：

这些 relay 编程接口向用户空间程序提供了访问 relay 通道缓冲区数据的基本操作的入口，包括：

open() - 允许用户打开一个已经存在的通道缓冲区。

mmap() - 使通道缓冲区被映射到位于用户空间的调用者的地址空间。要特别注意的是，我们不能仅对局部区域进行映射。也就是说，必须映射整个缓冲区文件，其大小是CPU的个数和单个CPU 缓冲区大小的乘积。

read() - 读取通道缓冲区的内容。这些数据一旦被读出，就意味着他们被用户空间的程序消费掉了，也就不能被之后的读操作看到。

sendfile() - 将数据从通道缓冲区传输到一个输出文件描述符。其中可能的填充字符会被自动去掉，不会被用户看到。

poll() - 支持 POLLIN/POLLRDNORM/POLLERR 信号。每次子缓冲区的边界被越过时，等待着的用户空间程序会得到通知。

close() - 将通道缓冲区的引用数减1。当引用数减为0时，表明没有进程或者内核用户需要打开它，从而这个通道缓冲区被释放。

2、 面向内核空间的API：

这些API接口向位于内核空间的用户提供了管理relay通道、数据写入等功能。包括：

relay_open() - 创建一个relay通道，包括创建每个CPU对应的relay缓冲区。

relay_close() - 关闭一个relay通道，包括释放所有的relay缓冲区，在此之前会调用relay_switch()来处理这些relay缓冲区以保证已读取但是未满的数据不会丢失。

relay_write() - 将数据写入到当前CPU对应的relay缓冲区内。由于它使用了local_irqsave()保护，因此也可以在中断上下文中使用。

relay_reserve() - 在relay通道中保留一块连续的区域来留给未来的写入操作。这通常用于那些希望直接写入到relay缓冲区的用户。考虑到性能或者其它因素，这些用户不希望先把数据写到一个临时缓冲区中，然后再通过relay_write()进行写入。

Linux relayfs的介绍以及使用

从Linux-2.6.14内核（2.6.12需要打补丁）开始，relayfs开始作为内核中File System选项中伪文件系统（Pseudo File System）来出现，这是一个新特性。
  File System--->
     Pseudo filesystems---->
    <>Relayfs File System Support
  我们知道，Pseduo File System 另外一个很有名的东西是Proc File System，几乎每个学习Linux的都知道使用这个文件系统来查看cpu型号、内存容量等其它很多的runtime information。Proc FS为users提供了一个方便的接口来查询很多只有内核才能查看的信息，比如：cpuinfo，meminfo，interrupts等，这些都只是 kernel管理的对象，但是我们可以以一个普通users的身份也可以查看。proc FS将内核信息可以动态地传递出来，供普通的process随时查看，某些情况下，用户也可以将信息传递到内核空间，比如：echo 1>/proc/sys/net/ipv4/ip_forward。同样地，relayfs也是可以一种内核和用户空间交换数据的工具，不同的是，它支持大容量的数据交换。

relayfs中有一个很重要的概念叫做“channel”，具体来说，一个channel就是由很多个内核的buffer组成的一个集合，这些内核的buffer在relayfs中就体现为一个个的文件。当kernel中的程序把数据写入某个channel时，这些数据实际上自动填入这些channel的buffer。用户空间的应用程序mmap（）将relayfs中的这些文件做个映射，然后在适当的时候把数据提取出来。

写入channel的数据格式完全取决于最终从channel中提取数据的程序，relayfs可以提取一些hook程序，这些hook程序允许relayfs的数据提取程序（relayfs的客户端）为buffer中的数据增加一些数据结构。这个过程，就像解码跟编码的关系一样，你使用的编码程序和解码程序只有对应就可以，与传输程序无关，当然，你在传输的同时也可以对它进行一些编码，但是这些取决于你最终的解码。但是，relayfs不提供任何形式的数据过滤，这些任务留给relayfs客户端去完成。 relayfs的设计目标就是尽可能地简单。

每一个relayfs channel都有一个buffer（单CPU情况），每一个buffer又有一个或者多个二级buffer。消息是从第一个二级buffer开始写入的，直到这个buffer满为止。然后如果第二个二级buffer可用，就写入第二个二级buffer，依次类推。所以，如果第一个二级buffer被填满，那么就会通知用户空间；同时，kernel就会去写第二个二级buffer。

如果kernel发出通知说一个二级buffer被填满了，那么kernel肯定知道填了多少字节。userspace根据这个数字就可以仅仅拷贝合法的数据。拷贝完毕，userpsace通知kernel说一个二级buffer已经被使用了。

relayfs采用这么一种模式，它会直接去覆盖数据，即使这些数据还没有被userspace所收集。

relayfs的user space API：

relayfs为了使得空间程序可以访问channel里面的buffer数据，实现了基本的文件操作。文件操作函数如下：
open 打开一个存在的buffer；
mmap 可以使得channel的buffer被映射到调用函数的内存空间，注意，不能部分映射，而是要映射整个文件；
read 读取channel buffer的内容；
poll 通知用户空间程序二级buffer空间已满；
close 关闭。

为了使得用户空间的程序可以使用relayfs文件，relayfs必须被mount，格式跟proc差不多:
mount -t relayfs relayfs /mnt/relay/

kernel空间的一些API:

relay_open(base_filename, parent, subbuf_size, n_subbufs, callbacks)
relay_close(chan)
relay_flush(chan)
relay_reset(chan)
relayfs_create_dir(name, parent)
relayfs_remove_dir(dentry)
relayfs_create_file(name, parent, mode, fops, data)
relayfs_remove_file(dentry)
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
buf_unmapped(buf, filp)
create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry)

relay (formerly relayfs)

what is relayfs?

Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space.

Here's a simple diagram illustrating the basic components of relayfs and a typical relayfs application:

This shows 4 cpus, each logging data to its own per-cpu kernel buffer via relay_write(), one of a few relayfs API functions used for writing data into a buffer (these functions automatically determine the correct buffer to write into based on the current cpu id). Each of the per-cpu buffers is represented by a user-specified filename in the relayfs file system. Once the filesystem is mounted, the kernel buffers can be mapped into the address space of a user space application by using mmap on the corresponding relayfs file. As data becomes available in the kernel buffer, write can be used to write it to disk, for example. That's pretty much all there is to it.

relayfs - a high-speed data relay filesystem

relayfs is a filesystem designed to provide an efficient mechanism fortools and facilities to relay large and potentially sustained streams of data from kernel space to user space.

The main abstraction of relayfs is the 'channel'.  A channel consists of a set of per-cpu kernel buffers each represented by a file in the relayfs filesystem.  Kernel clients write into a channel using efficient write functions which automatically log to the current cpu's channel buffer. User space applications mmap() the per-cpu files and retrieve the data as it becomes available.

The format of the data logged into the channel buffers is completely up to the relayfs client; relayfs does however provide hooks which allow clients to impose some stucture on the buffer data.  Nor does relayfs implement any form of data filtering - this also is left to the client.  The purpose is to keep relayfs as simple as possible.

This document provides an overview of the relayfs API.  The details of the function parameters are documented along with the functions in the filesystem code - please see that for details.

The relayfs user space API：

==========================

relayfs implements basic file operations for user space access to relayfs channel buffer data.  Here are the file operations that are available and some comments regarding their behavior:

open()   enables user to open an _existing_ buffer.

mmap()   results in channel buffer being mapped into the caller's     memory space.

poll()   POLLIN/POLLRDNORM/POLLERR supported. User applications are   notified when sub-buffer boundaries are crossed.

close() decrements the channel buffer's refcount.  When the refcount reaches 0 i.e. when no process or kernel client has the buffer       open, the channel buffer is freed.

In order for a user application to make use of relayfs files, the relayfs filesystem must be mounted.  For example,

        mount -t relayfs relayfs /mnt/relay

NOTE:   relayfs doesn't need to be mounted for kernel clients to create

        or use channels - it only needs to be mounted when user space

        applications need access to the buffer data.

The relayfs kernel API

======================

Here's a summary of the API relayfs provides to in-kernel clients:

  channel management functions:

    relay_open(base_filename, parent, subbuf_size, n_subbufs,               overwrite,callbacks)

    relay_close(chan)

    relay_flush(chan)

    relay_reset(chan)

    relayfs_create_dir(name, parent)

    relayfs_remove_dir(dentry)

    relay_commit(buf, reserved, count)

    relay_subbufs_consumed(chan, cpu, subbufs_consumed)

  write functions:

    relay_write(chan, data, length)

    __relay_write(chan, data, length)

    relay_reserve(chan, length)

  callbacks:

    subbuf_start(buf, subbuf, prev_subbuf_idx, prev_subbuf)

    deliver(buf, subbuf_idx, subbuf)

    buf_mapped(buf, filp)

    buf_unmapped(buf, filp)

    buf_full(buf, subbuf_idx)

A relayfs channel is made of up one or more per-cpu channel buffers,

each implemented as a circular buffer subdivided into one or more

sub-buffers.

relay_open() is used to create a channel, along with its per-cpu channel buffers.  Each channel buffer will have an associated file created for it in the relayfs filesystem, which can be opened and mmapped from user space if desired.  The files are named basename0...basenameN-1 where N is the number of online cpus, and by default will be created in the root of the filesystem.  If you want a directory structure to contain your relayfs files, you can create it with relayfs_create_dir() and pass the parent directory to relay_open().  Clients are responsible for cleaning up any directory structure they create when the channel is closed – use relayfs_remove_dir() for that.

The total size of each per-cpu buffer is calculated by multiplying the number of sub-buffers by the sub-buffer size passed into relay_open(). The idea behind sub-buffers is that they're basically an extension of double-buffering to N buffers, and they also allow applications to easily implement random-access-on-buffer-boundary schemes, which can be important for some high-volume applications.  The number and size of sub-buffers is completely dependent on the application and even for the same application, different conditions will warrant different values for these parameters at different times.  Typically, the right values to use are best decided after some experimentation; in general, though, it's safe to assume that having only 1 sub-buffer is a bad idea - you're guaranteed to either overwrite data or lose events depending on the channel mode being used.

relayfs channels can be opened in either of two modes - 'overwrite' or 'no-overwrite'.  In overwrite mode, writes continuously cycle around the buffer and will never fail, but will unconditionally overwrite old data regardless of whether it's actually been consumed.  In no-overwrite mode, writes will fail i.e. data will be lost, if the number of unconsumed sub-buffers equals the total number of sub-buffers in the channel.  In this mode, the client is reponsible for notifying relayfs when sub-buffers have been consumed via relay_subbufs_consumed().  A full buffer will become 'unfull' and logging will continue once the client calls relay_subbufs_consumed() again.  When a buffer becomes full, the buf_full() callback is invoked to notify the client.  In both modes, the subbuf_start() callback will notify the client whenever a sub-buffer boundary is crossed.  This can be used to write header information into the new sub-buffer or fill in header information reserved in the previous sub-buffer.  One piece of information that's useful to save in a reserved header slot is the number of bytes of 'padding' for a sub-buffer, which is the amount of unused space at the end of a sub-buffer.  The padding count for each sub-buffer is contained in an array in the rchan_buf struct passed into the subbuf_start() callback: rchan_buf->padding[prev_subbuf_idx] can be used to to get the padding for the just-finished sub-buffer. subbuf_start() is also called for the first sub-buffer in each channel buffer when the channel is created. The mode is specified to relay_open() using the overwrite parameter.

kernel clients write data into the current cpu's channel buffer using relay_write() or __relay_write().  relay_write() is the main logging function - it uses local_irqsave() to protect the buffer and should be used if you might be logging from interrupt context. If you know you'll never be logging from interrupt context, you can use __relay_write(), which only disables preemption.  These functions don't return a value, so you can't determine whether or not they failed - the assumption is that you wouldn't want to check a return value in the fast logging path anyway, and that they'll always succeed unless the buffer is full and in no-overwrite mode, in which case you'll be notified via the buf_full() callback.

relay_reserve() is used to reserve a slot in a channel buffer which can be written to later.  This would typically be used in applications that need to write directly into a channel buffer without having to stage data in a temporary buffer beforehand. Because the actual write may not happen immediately after the slot is reserved, applications using relay_reserve() can call relay_commit() to notify relayfs when the slot has actually been written.  When all the reserved slots have been committed, the deliver() callback is invoked to notify the client that a guaranteed full sub-buffer has been produced.  Because the write is under control of the client and is separated from the reserve, relay_reserve() doesn't protect the buffer at all - it's up to the client to provide the appropriate synchronization when using relay_reserve().

The client calls relay_close() when it's finished using the channel. The channel and its associated buffers are destroyed when there are no longer any references to any of the channel buffers. relay_flush() forces a sub-buffer switch on all the channel buffers, and can be used to finalize and process the last sub-buffers before the channel is closed.

Some applications may want to keep a channel around and re-use it rather than open and close a new channel for each use.  relay_reset() can be used for this purpose - it resets a channel to its initial state without reallocating channel buffer memory or destroying existing mappings.  It should however only be called when it's safe to do so i.e. when the channel isn't currently being written to.

Finally, there are a couple of utility callbacks that can be used for different purposes.  buf_mapped() is called whenever a channel buffer is mmapped from user space and buf_unmapped() is called when it's unmapped.  The client can use this notification to trigger actions within the kernel application, such as enabling/disabling logging to the channel.

阅读(3234) | 评论(0) | 转发(0) |

上一篇：Linux 系统内核空间与用户空间通信的实现与分析

下一篇：使用kprobes，截获execve/fork/vfork/clone等系统调用

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6