epoll的实现原理--ChinaUnix博客

pygmalion666的ChinaUnix博客
首页　| 　博文目录　| 　关于我
pygmalion666
博客访问： 104999
博文数量： 59
博客积分： 0
博客等级：民兵
技术积分： 0
用户组：普通用户
注册时间： 2018-11-18 23:26
文章分类
全部博文（59）
未分配的博文（59）
文章存档
2021年（1）
2013年（1）
2012年（57）
我的朋友
最近访客
推荐博文
epoll的实现原理
分类：
2012-10-15 17:19:07
原文地址：epoll的实现原理作者：tchlinux
epoll的实现原理
2009-04-17 13:05
1 功能介绍
     epoll与select/poll不同的一点是，它是由一组系统调用组成。
     int epoll_create(int size);
     int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
     int epoll_wait(int epfd, struct epoll_event *events,
                       int maxevents, int timeout);
     epoll相关系统调用是在Linux 2.5.44开始引入的。该系统调用针对传统的selec
t/poll系统调用的不足，设计上作了很大的改动。select/poll的缺点在于：
     1.每次调用时要重复地从用户态读入参数。
     2.每次调用时要重复地扫描文件描述符。
     3.每次在调用开始时，要把当前进程放入各个文件描述符的等待队列。在调用结
束后，又把进程从各个等待队列中删除。
     在实际应用中，select/poll监视的文件描述符可能会非常多，如果每次只是返回
一小部分，那么，这种情况下select/poll显得不够高效。epoll的设计思路，是把s
elect/poll单个的操作拆分为1个epoll_create+多个epoll_ctrl+一个wait。此外，
内核针对epoll操作添加了一个文件系统”eventpollfs”，每一个或者多个要监视的
文件描述符都有一个对应的eventpollfs文件系统的inode节点，主要信息保存在eve
ntpoll结构体中。而被监视的文件的重要信息则保存在epitem结构体中。所以他们
是一对多的关系。
     由于在执行epoll_create和epoll_ctrl时，已经把用户态的信息保存到内核态了
，所以之后即使反复地调用epoll_wait，也不会重复地拷贝参数，扫描文件描述符，
反复地把当前进程放入/放出等待队列。这样就避免了以上的三个缺点。
     接下去看看它们的实现：
2 关键结构体：
/* Wrapper struct used by poll queueing */
struct ep_pqueue {
         poll_table pt;
         struct epitem *epi;
};
     这个结构体类似于select/poll中的struct poll_wqueues。由于epoll需要在内核
态保存大量信息，所以光光一个回调函数指针已经不能满足要求，所以在这里引入了
一个新的结构体struct epitem。
/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the hash.
 */
struct epitem {
         /* RB-Tree node used to link this structure to the eventpoll rb
-tree */
         struct rb_node rbn;
红黑树，用来保存eventpoll
         /* List header used to link this structure to the eventpoll rea
dy list */
         struct list_head rdllink;
双向链表，用来保存已经完成的eventpoll
         /* The file descriptor information this item refers to */
         struct epoll_filefd ffd;
这个结构体对应的被监听的文件描述符信息
         /* Number of active wait queue attached to poll operations */
         int nwait;
poll操作中事件的个数
         /* List containing poll wait queues */
         struct list_head pwqlist;
双向链表，保存着被监视文件的等待队列，功能类似于select/poll中的poll_tab
le
         /* The "container" of this item */
         struct eventpoll *ep;
指向eventpoll，多个epitem对应一个eventpoll
         /* The structure that describe the interested events and the so
urce fd */
         struct epoll_event event;
记录发生的事件和对应的fd
         /*
          * Used to keep track of the usage count of the structure. This
 avoids
          * that the structure will desappear from underneath our proces
sing.
          */
         atomic_t usecnt;
引用计数
         /* List header used to link this item to the "struct file" item
s list */
         struct list_head fllink;
双向链表，用来链接被监视的文件描述符对应的struct file。因为file里有f_ep
_link，用来保存所有监视这个文件的epoll节点
         /* List header used to link the item to the transfer list */
         struct list_head txlink;
双向链表，用来保存传输队列
         /*
          * This is used during the collection/transfer of events to use
rspace
          * to pin items empty events set.
          */
         unsigned int revents;
文件描述符的状态，在收集和传输时用来锁住空的事件集合
};
     该结构体用来保存与epoll节点关联的多个文件描述符，保存的方式是使用红黑树
实现的hash表。至于为什么要保存，下文有详细解释。它与被监听的文件描述符一一
对应。
struct eventpoll {
         /* Protect the this structure access */
         rwlock_t lock;
读写锁
         /*
          * This semaphore is used to ensure that files are not removed
          * while epoll is using them. This is read-held during the even
t
          * collection loop and it is write-held during the file cleanup
          * path, the epoll file exit code and the ctl operations.
          */
         struct rw_semaphore sem;
读写信号量
         /* Wait queue used by sys_epoll_wait() */
         wait_queue_head_t wq;
         /* Wait queue used by file->poll() */
         wait_queue_head_t poll_wait;
         /* List of ready file descriptors */
         struct list_head rdllist;
已经完成的操作事件的队列。
         /* RB-Tree root used to store monitored fd structs */
         struct rb_root rbr;
保存epoll监视的文件描述符
};
     这个结构体保存了epoll文件描述符的扩展信息，它被保存在file结构体的priva
te_data中。它与epoll文件节点一一对应。通常一个epoll文件节点对应多个被监视
的文件描述符。所以一个eventpoll结构体会对应多个epitem结构体。
     那么，epoll中的等待事件放在哪里呢？见下面
/* Wait structure used by the poll hooks */
struct eppoll_entry {
         /* List header used to link this structure to the "struct epite
m" */
         struct list_head llink;
         /* The "base" pointer is set to the container "struct epitem" *
/
         void *base;
         /*
          * Wait queue item that will be linked to the target file wait
          * queue head.
          */
         wait_queue_t wait;
         /* The wait queue head that linked the "wait" wait queue item *
/
         wait_queue_head_t *whead;
};
     与select/poll的struct poll_table_entry相比，epoll的表示等待队列节点的结
构体只是稍有不同，与struct poll_table_entry比较一下。
struct poll_table_entry {
         struct file * filp;
         wait_queue_t wait;
         wait_queue_head_t * wait_address;
};
     由于epitem对应一个被监视的文件，所以通过base可以方便地得到被监视的文件
信息。又因为一个文件可能有多个事件发生，所以用llink链接这些事件。
3 epoll_create的实现
     epoll_create()的功能是创建一个eventpollfs文件系统的inode节点。具体由ep
_getfd()完成。ep_getfd()先调用ep_eventpoll_inode()创建一个inode节点，然后
调用d_alloc()为inode分配一个dentry。最后把file,dentry,inode三者关联起来。
     在执行了ep_getfd()之后，它又调用了ep_file_init(),分配了eventpoll结构体
，并把eventpoll的指针赋给file结构体，这样eventpoll就与file结构体关联起来了
。
     需要注意的是epoll_create()的参数size实际上只是起参考作用，只要它不小于
等于0，就并不限制这个epoll inode关联的文件描述符数量。
4 epoll_ctl的实现
     epoll_ctl的功能是实现一系列操作，如把文件与eventpollfs文件系统的inode节
点关联起来。这里要介绍一下eventpoll结构体，它保存在file->f_private中，记录
了eventpollfs文件系统的inode节点的重要信息，其中成员rbr保存了该epoll文件节
点监视的所有文件描述符。组织的方式是一棵红黑树，这种结构体在查找节点时非常
高效。
     首先它调用ep_find()从eventpoll中的红黑树获得epitem结构体。然后根据op参
数的不同而选择不同的操作。如果op为EPOLL_CTL_ADD，那么正常情况下epitem是不
可能在eventpoll的红黑树中找到的，所以调用ep_insert创建一个epitem结构体并插
入到对应的红黑树中。
     ep_insert()首先分配一个epitem对象，对它初始化后，把它放入对应的红黑树。
此外，这个函数还要作一个操作，就是把当前进程放入对应文件操作的等待队列。这
一步是由下面的代码完成的。
     init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
     。。。
     revents = tfile->f_op->poll(tfile, &epq.pt);
     函数先调用init_poll_funcptr注册了一个回调函数 ep_ptable_queue_proc，这
个函数会在调用f_op->poll时被执行。该函数分配一个epoll等待队列结点eppoll_e
ntry：一方面把它挂到文件操作的等待队列中，另一方面把它挂到epitem的队列中
。此外，它还注册了一个等待队列的回调函数ep_poll_callback。当文件操作完成，
唤醒当前进程之前，会调用ep_poll_callback()，把eventpoll放到epitem的完成队
列中，并唤醒等待进程。
     如果在执行f_op->poll以后，发现被监视的文件操作已经完成了，那么把它放在
完成队列中了，并立即把等待操作的那些进程唤醒。
5 epoll_wait的实现
     epoll_wait的工作是等待文件操作完成并返回。
     它的主体是ep_poll()，该函数在for循环中检查epitem中有没有已经完成的事件
，有的话就把结果返回。没有的话调用schedule_timeout()进入休眠，直到进程被再
度唤醒或者超时。
6 性能分析
     epoll机制是针对select/poll的缺陷设计的。通过新引入的eventpollfs文件系统
，epoll把参数拷贝到内核态，在每次轮询时不会重复拷贝。通过把操作拆分为epol
l_create,epoll_ctl,epoll_wait，避免了重复地遍历要监视的文件描述符。此外，
由于调用epoll的进程被唤醒后，只要直接从epitem的完成队列中找出完成的事件，
找出完成事件的复杂度由O(N)降到了O(1)。
     但是epoll的性能提高是有前提的，那就是监视的文件描述符非常多，而且每次完
成操作的文件非常少。所以，epoll能否显著提高效率，取决于实际的应用场景。这
方面需要进一步测试。
转自
阅读(923) | 评论(0) | 转发(0) |
上一篇：高并发的epoll+线程池，epoll在线程池内
下一篇：用valgrind检测内存泄露
给主人留下些什么吧！~~
感谢所有关心和支持过ChinaUnix的朋友们
16024965号-6