epoll实现原理-lc0060305-ChinaUnix博客

李庚睿（lgr）的博客 -- 蔚蓝天空garry.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

lc0060305

博客访问： 3586017
博文数量： 1450
博客积分： 11163
博客等级：上将
技术积分： 11101
用户组：普通用户
注册时间： 2005-07-25 14:40

文章分类

全部博文（1450）

音视频直播（2）
linux各种服务器（3）
ARM学习（8）

ARM汇编指令（7）
手机开发（230）

android（2）

iphone（4）

symbian（224）
nginx 分析（6）
vi常用方法（13）
linux 常用命令（65）

linux shell 脚本（38）
window批处理资料（15）
黑客技术（20）

linux 系统安全（12）
搜索引擎与网络爬（32）
数据库技术（143）
网络技术（25）

网络测试方法（2）
操作系统研究（192）

android源码分析（1）

linux驱动（20）
程序设计（513）

调试技术（3）

测试方法（7）

性能调优（2）

debian（1）

JNI（5）

configure.ac（1）

Makefile.am（3）

设计模式（19）

算法与数据结构（4）

java程序开发（103）

web程序开发（41）
随笔（129）

地图集（14）

英语（4）

笑话（56）

我喜爱的诗（6）

我的小诗（4）
未分配的博文（54）

文章存档

2017年（5）

2014年（2）

2013年（3）

2012年（35）

2011年（39）

2010年（88）

2009年（395）

2008年（382）

2007年（241）

2006年（246）

2005年（14）

我的朋友

相关博文

epoll实现原理

分类： LINUX

2010-03-23 18:50:34

1 功能介绍
    epoll与select/poll不同的一点是，它是由一组系统调用组成。
    int epoll_create(int size);
    int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
    int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
    epoll相关系统调用是在Linux 2.5.44开始引入的。该系统调用针对传统的selec
t/poll系统调用的不足，设计上作了很大的改动。select/poll的缺点在于：
    1.每次调用时要重复地从用户态读入参数。
    2.每次调用时要重复地扫描文件描述符。
    3.每次在调用开始时，要把当前进程放入各个文件描述符的等待队列。在调用结
束后，又把进程从各个等待队列中删除。
    在实际应用中，select/poll监视的文件描述符可能会非常多，如果每次只是返回
一小部分，那么，这种情况下select/poll显得不够高效。epoll的设计思路，是把s
elect/poll单个的操作拆分为1个epoll_create+多个epoll_ctrl+一个wait。此外，
内核针对epoll操作添加了一个文件系统”eventpollfs”，每一个或者多个要监视的
文件描述符都有一个对应的eventpollfs文件系统的inode节点，主要信息保存在eve
ntpoll结构体中。而被监视的文件的重要信息则保存在epitem结构体中。所以他们
是一对多的关系。
    由于在执行epoll_create和epoll_ctrl时，已经把用户态的信息保存到内核态了
，所以之后即使反复地调用epoll_wait，也不会重复地拷贝参数，扫描文件描述符，
反复地把当前进程放入/放出等待队列。这样就避免了以上的三个缺点。
    接下去看看它们的实现：
2 关键结构体：
/* Wrapper struct used by poll queueing */
struct ep_pqueue {
        poll_table pt;
        struct epitem *epi;
};
    这个结构体类似于select/poll中的struct poll_wqueues。由于epoll需要在内核
态保存大量信息，所以光光一个回调函数指针已经不能满足要求，所以在这里引入了
一个新的结构体struct epitem。
/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the hash.
 */
struct epitem {
        /* RB-Tree node used to link this structure to the eventpoll rb
-tree */
        struct rb_node rbn;
红黑树，用来保存eventpoll
        /* List header used to link this structure to the eventpoll rea
dy list */
        struct list_head rdllink;
双向链表，用来保存已经完成的eventpoll
        /* The file descriptor information this item refers to */
        struct epoll_filefd ffd;
这个结构体对应的被监听的文件描述符信息
        /* Number of active wait queue attached to poll operations */
        int nwait;
poll操作中事件的个数
        /* List containing poll wait queues */
        struct list_head pwqlist;
双向链表，保存着被监视文件的等待队列，功能类似于select/poll中的poll_tab
le
        /* The "container" of this item */
        struct eventpoll *ep;
指向eventpoll，多个epitem对应一个eventpoll
        /* The structure that describe the interested events and the so
urce fd */
        struct epoll_event event;
记录发生的事件和对应的fd
        /*
         * Used to keep track of the usage count of the structure. This
 avoids
         * that the structure will desappear from underneath our proces
sing.
         */
        atomic_t usecnt;
引用计数
        /* List header used to link this item to the "struct file" item
s list */
        struct list_head fllink;
双向链表，用来链接被监视的文件描述符对应的struct file。因为file里有f_ep
_link，用来保存所有监视这个文件的epoll节点
        /* List header used to link the item to the transfer list */
        struct list_head txlink;
双向链表，用来保存传输队列
        /*
         * This is used during the collection/transfer of events to use
rspace
         * to pin items empty events set.
         */
        unsigned int revents;
文件描述符的状态，在收集和传输时用来锁住空的事件集合
};
    该结构体用来保存与epoll节点关联的多个文件描述符，保存的方式是使用红黑树
实现的hash表。至于为什么要保存，下文有详细解释。它与被监听的文件描述符一一
对应。
struct eventpoll {
        /* Protect the this structure access */
        rwlock_t lock;
读写锁
        /*
         * This semaphore is used to ensure that files are not removed
         * while epoll is using them. This is read-held during the even
t
         * collection loop and it is write-held during the file cleanup
         * path, the epoll file exit code and the ctl operations.
         */
        struct rw_semaphore sem;
读写信号量
        /* Wait queue used by sys_epoll_wait() */
        wait_queue_head_t wq;
        /* Wait queue used by file->poll() */
        wait_queue_head_t poll_wait;
        /* List of ready file descriptors */
        struct list_head rdllist;
已经完成的操作事件的队列。
        /* RB-Tree root used to store monitored fd structs */
        struct rb_root rbr;
保存epoll监视的文件描述符
};
    这个结构体保存了epoll文件描述符的扩展信息，它被保存在file结构体的priva
te_data中。它与epoll文件节点一一对应。通常一个epoll文件节点对应多个被监视
的文件描述符。所以一个eventpoll结构体会对应多个epitem结构体。
    那么，epoll中的等待事件放在哪里呢？见下面
/* Wait structure used by the poll hooks */
struct eppoll_entry {
        /* List header used to link this structure to the "struct epite
m" */
        struct list_head llink;
        /* The "base" pointer is set to the container "struct epitem" *
/
        void *base;
        /*
         * Wait queue item that will be linked to the target file wait
         * queue head.
         */
        wait_queue_t wait;
        /* The wait queue head that linked the "wait" wait queue item *
/
        wait_queue_head_t *whead;
};
    与select/poll的struct poll_table_entry相比，epoll的表示等待队列节点的结
构体只是稍有不同，与struct poll_table_entry比较一下。
struct poll_table_entry {
        struct file * filp;
        wait_queue_t wait;
        wait_queue_head_t * wait_address;
};
    由于epitem对应一个被监视的文件，所以通过base可以方便地得到被监视的文件
信息。又因为一个文件可能有多个事件发生，所以用llink链接这些事件。
3 epoll_create的实现
    epoll_create()的功能是创建一个eventpollfs文件系统的inode节点。具体由ep
_getfd()完成。ep_getfd()先调用ep_eventpoll_inode()创建一个inode节点，然后
调用d_alloc()为inode分配一个dentry。最后把file,dentry,inode三者关联起来。
    在执行了ep_getfd()之后，它又调用了ep_file_init(),分配了eventpoll结构体
，并把eventpoll的指针赋给file结构体，这样eventpoll就与file结构体关联起来了
。
    需要注意的是epoll_create()的参数size实际上只是起参考作用，只要它不小于
等于0，就并不限制这个epoll inode关联的文件描述符数量。
4 epoll_ctl的实现
    epoll_ctl的功能是实现一系列操作，如把文件与eventpollfs文件系统的inode节
点关联起来。这里要介绍一下eventpoll结构体，它保存在file->f_private中，记录
了eventpollfs文件系统的inode节点的重要信息，其中成员rbr保存了该epoll文件节
点监视的所有文件描述符。组织的方式是一棵红黑树，这种结构体在查找节点时非常
高效。
    首先它调用ep_find()从eventpoll中的红黑树获得epitem结构体。然后根据op参
数的不同而选择不同的操作。如果op为EPOLL_CTL_ADD，那么正常情况下epitem是不
可能在eventpoll的红黑树中找到的，所以调用ep_insert创建一个epitem结构体并插
入到对应的红黑树中。
    ep_insert()首先分配一个epitem对象，对它初始化后，把它放入对应的红黑树。
此外，这个函数还要作一个操作，就是把当前进程放入对应文件操作的等待队列。这
一步是由下面的代码完成的。
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    。。。
    revents = tfile->f_op->poll(tfile, &epq.pt);
    函数先调用init_poll_funcptr注册了一个回调函数 ep_ptable_queue_proc，这
个函数会在调用f_op->poll时被执行。该函数分配一个epoll等待队列结点eppoll_e
ntry：一方面把它挂到文件操作的等待队列中，另一方面把它挂到epitem的队列中
。此外，它还注册了一个等待队列的回调函数ep_poll_callback。当文件操作完成，
唤醒当前进程之前，会调用ep_poll_callback()，把eventpoll放到epitem的完成队
列中，并唤醒等待进程。
    如果在执行f_op->poll以后，发现被监视的文件操作已经完成了，那么把它放在
完成队列中了，并立即把等待操作的那些进程唤醒。
5 epoll_wait的实现
    epoll_wait的工作是等待文件操作完成并返回。
    它的主体是ep_poll()，该函数在for循环中检查epitem中有没有已经完成的事件
，有的话就把结果返回。没有的话调用schedule_timeout()进入休眠，直到进程被再
度唤醒或者超时。
6 性能分析
    epoll机制是针对select/poll的缺陷设计的。通过新引入的eventpollfs文件系统
，epoll把参数拷贝到内核态，在每次轮询时不会重复拷贝。通过把操作拆分为epol
l_create,epoll_ctl,epoll_wait，避免了重复地遍历要监视的文件描述符。此外，
由于调用epoll的进程被唤醒后，只要直接从epitem的完成队列中找出完成的事件，
找出完成事件的复杂度由O(N)降到了O(1)。
    但是epoll的性能提高是有前提的，那就是监视的文件描述符非常多，而且每次完
成操作的文件非常少。所以，epoll能否显著提高效率，取决于实际的应用场景。这
方面需要进一步测试。

阅读(655) | 评论(0) | 转发(0) |

上一篇：epoll分析

下一篇：内核中的 likely() 与 unlikely()

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6