cgroup分析（转）-瀚海书香-ChinaUnix博客

瀚海书香forever.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

瀚海书香

博客访问： 3962141
博文数量： 93
博客积分： 3189
博客等级：中校
技术积分： 4229
用户组：普通用户
注册时间： 2009-02-02 13:29

个人简介

出没于杭州和青岛的程序猿一枚，对内核略懂一二

文章分类

全部博文（93）

Scala（1）
Windows编程（1）
数据库（1）
网络安全（1）
杂谈（4）
ARM（1）
linux系统（24）

虚拟化（1）
linux内核分析（32）
ruby（4）
C（7）
linux 编程（4）
linux内核编程（13）
未分配的博文（0）

文章存档

2016年（2）

2015年（3）

2014年（11）

2013年（29）

2012年（16）

2011年（5）

2010年（5）

2009年（22）

我的朋友

相关博文

cgroup分析（转）

分类： LINUX

2014-10-11 09:07:56

原文地址：cgroup分析（转）作者：crazytyt

Linux cgroup机制分析之框架分析

来源: ChinaUnix博客　日期： 2008.12.23 16:15　(共有0条评论)

------------------------------------------
本文系本站原创,欢迎转载!
转载请注明出处:http://ericxiao.cublog.cn/
------------------------------------------
一: 前言
前段时间,一直在写操作系统和研究Solaris kernel.从而对linux kernel关心甚少.不久前偶然收到富士通的面试,由于诸多原因推辞掉了这次机会.不过招聘要求给我留下了较深的印像.其中涉及到了cgroup机制.cgroup对我来说并不陌生,在LKML上看到过它的path.在2008 AKA大会上也有人对它做为专题分析.不过一直都没有深入代码研究.这段时间打算将kernel中新加的功能整理一下,就先从cgroup开始吧.
Cgroup是近代linux kernel出现的.它为进程和其后续的子进程提供了一种性能控制机制.在这里不打算对cgroup的作用和使用做过多的描述.本文从linux kernel的源代码出发分析cgroup机制的相关实现.在本节中,主要分析cgroup的框架实现.在后续的部份再来详细分析kernel中的几个重要的subsystem.关于cgroup的使用和介绍可以查看linux-2.6.28-rc7/Documentation/cgroups /cgroup.txt.另外,本文的源代码分析基于linux kernel 2.6.28版本.分析的源文件基本位于inux-2.6.28-rc7/kernel/cgroup.c和inux-2.6.28-rc7 /kernel/debug_cgroup.c中.
二:cgroup中的概念
在深入到cgroup的代码分析之前.先来了解一下cgroup中涉及到的几个概念:
1:cgroup: 它的全称为control group.即一组进程的行为控制.比如,我们限制进程/bin/sh的CPU使用为20%.我们就可以建一个cpu占用为20%的cgroup.然后将 /bin/sh进程添加到这个cgroup中.当然,一个cgroup可以有多个进程.
2:subsystem: 它类似于我们在netfilter中的过滤hook.比如上面的CPU占用率就是一个subsystem.简而言之.subsystem就是cgroup 中可添加删除的模块.在cgroup架构的封装下为cgroup提供多种行为控制.subsystem在下文中简写成subsys.
3: hierarchy: 它是cgroup的集合.可以把它理解成cgroup的根.cgroup是hierarchy的结点.还是拿上面的例子: 整个cpu占用为100%.这就是根,也就是hierarchy.然后,cgroup A设置cpu占用20%,cgroup B点用50%,cgroup A和cgroup B就是它下面的子层cgroup.
三:cgroup中的重要数据结构
我们先来看cgroup的使用.有三面一个例子:
[root@localhost cgroups]# mount -t cgroup cgroup -o debug /dev/cgroup
[root@localhost cgroups]# mkdir /dev/cgroup/eric_test
如上所示,用debug subsystem做的一个测试. /dev/cgroup是debug subsys的挂载点.也就是我们在上面所分析的hierarchy.然后在hierarchy下又创建了一个名为eric_test的cgroup.
在kernel的源代码中.挂载目录,也就是cgroup的根目录用数据结构struct cgroupfs_root表示.而cgroup用struct cgroup表示.
分别来看一下这两个结构的含义,struct cgroupfs_root定义如下:
struct cgroupfs_root {
//cgroup文件系统的超级块
struct super_block *sb;

/*
   * The bitmask of subsystems intended to be attached to this
   * hierarchy
   */
   //hierarchy相关联的subsys 位图
unsigned long subsys_bits;

/* The bitmask of subsystems currently attached to this hierarchy */
//当前hierarchy 中的subsys位图
unsigned long actual_subsys_bits;

/* A list running through the attached subsystems */
//hierarchy中的subsys链表
struct list_head subsys_list;

/* The root cgroup for this hierarchy */
//hierarchy中的顶层cgroup
struct cgroup top_cgroup;

/* Tracks how many cgroups are currently defined in hierarchy.*/
//hierarchy中cgroup的数目
int number_of_cgroups;

/* A list running through the mounted hierarchies */
//用来链入全局链表roots
struct list_head root_list;

/* Hierarchy-specific flags */
//hierarchy的标志
unsigned long flags;

/* The path to use for release notifications. */
char release_agent_path[PATH_MAX];
};
注意cgroupfs_root中有个struct cgroup结构的成员:top_cgroup.即在每个挂载点下面都会有一个总的cgroup.而通过mkdir创建的cgroup是它的子结点.
其中,release_agent_path[ ]的成员含义.我们在后面再来详细分析.

Struct cgroup的定义如下:
struct cgroup {
//cgroup的标志
unsigned long flags;       /* "unsigned long" so bitops work */

/* count users of this cgroup. >0 means busy, but doesn't
   * necessarily indicate the number of tasks in the
   * cgroup */
   //引用计数
atomic_t count;

/*
   * We link our 'sibling' struct into our parent's 'children'.
   * Our children link their 'sibling' into our 'children'.
   */
   //用来链入父结点的children链表
struct list_head sibling; /* my parent's children */
//子结点链表
struct list_head children;  /* my children */
//cgroup的父结点
struct cgroup *parent;  /* my parent */
//cgroup所处的目录
struct dentry *dentry;    /* cgroup fs entry */

/* Private pointers for each registered subsystem */
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
//cgroup所属的cgroupfs_root
struct cgroupfs_root *root;
//挂载目录下的最上层cgroup
struct cgroup *top_cgroup;
……
……
}
上面并没有将cgroup的结构全部都列出来.其它的全部我们等遇到的时候再来进行分析.
其实,struct cgroupfs_root和struct cgroup就是表示了一种空间层次关系,它就对应着挂着点下面的文件示图.

在上面说过了,cgroup表示进程的行为控制.因为subsys必须要知道进程是位于哪一个cgroup.
所以.在struct task_struct和cgroup中存在一种映射.
Cgroup在struct task_struct中增加了两个成员,如下示:
struct task_struct {
……
……
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock */
struct css_set *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list;
#endif
……
……
}
注意struct task_struct中并没有一个直接的成员指向cgroup,而是指向了css_set.css_set的结构如下:
struct css_set {
//css_set引用计数
atomic_t refcount;
//哈希指针.指向css_set_table[ ]
struct hlist_node hlist;
//与css_set关联的task链表
struct list_head tasks;
//与css_set关联的cg_cgroup_link链表
struct list_head cg_links;
//一组subsystem states.由subsys->create()创建而成
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}
那从css_set怎么转换到cgroup呢? 再来看一个辅助的数据结构.struct cg_cgroup_link.它的定义如下:
struct cg_cgroup_link {
/*
   * List running through cg_cgroup_links associated with a
   * cgroup, anchored on cgroup->css_sets
   */
struct list_head cgrp_link_list;
/*
   * List running through cg_cgroup_links pointing at a
   * single css_set object, anchored on css_set->cg_links
   */
struct list_head cg_link_list;
struct css_set *cg;
};
如上所示.它的cgrp_link_list链入到了cgroup->css_sets. Cg_link_list链入到css_set->cg_links.
其中.cg就是批向cg_link_list所指向的css_set.

上面分析的几个数据结构关系十分复杂.联系也十分紧密.下面以图示的方式直观将各结构的联系表示如下:

注意上图中的css_set_table[ ].它是一个哈希数组.用来存放struct css_set.它的哈希函数为css_set_hash().所有的冲突项都链入数组对应项的hlist.

四:cgroup初始化
Cgroup的初始化包括两个部份.即cgroup_init_early()和cgroup_init().分别表示在系统初始时的初始化和系统初始化完成时的初始化.分为这两个部份是因为有些subsys是要在系统刚启动的时候就必须要初始化的.

4.1: cgroup_init_early()
先看cgroup_init_early()的代码:
int __init cgroup_init_early(void)
{
int i;
//初始化全局量init_css_set
atomic_set(&init_css_set.refcount, 1);
INIT_LIST_HEAD(&init_css_set.cg_links);
INIT_LIST_HEAD(&init_css_set.tasks);
INIT_HLIST_NODE(&init_css_set.hlist);
//css_set_count:系统中struct css_set计数
css_set_count = 1;
//初始化全局变量rootnode
init_cgroup_root(&rootnode);
//将全局变量rootnode添加到roots链表
list_add(&rootnode.root_list, &roots);
root_count = 1;
//使系统的初始化进程cgroup指向init_css_set
init_task.cgroups = &init_css_set;
//将init_css_set和rootnode.top_cgroup关联起来
init_css_set_link.cg = &init_css_set;
list_add(&init_css_set_link.cgrp_link_list,
      &rootnode.top_cgroup.css_sets);
list_add(&init_css_set_link.cg_link_list,
      &init_css_set.cg_links);
//初始化css_set_table[ ]
for (i = 0; i
      INIT_HLIST_HEAD(&css_set_table);
//对一些需要在系统启动时初始化的subsys进行初始化
for (i = 0; i
      struct cgroup_subsys *ss = subsys;

      BUG_ON(!ss->name);
      BUG_ON(strlen(ss->name) > MAX_CGROUP_TYPE_NAMELEN);
      BUG_ON(!ss->create);
      BUG_ON(!ss->destroy);
      if (ss->subsys_id != i) {
         printk(KERN_ERR "cgroup: Subsys %s id == %d\n",
               ss->name, ss->subsys_id);
         BUG();
      }

      if (ss->early_init)
         cgroup_init_subsys(ss);
}
return 0;
}
这里主要是初始化init_task.cgroup结构.伴随着它的初始化.相继需要初始化rootnode和init_css_set.接着,又需要初始化init_css_set_link将rootnode.top_cgroup和init_css_set关联起来.
接着初始化了哈希数组css_set_table[]并且将一些需要在系统刚启动时候需要初始化的subsys进行初始化.
从上面的代码可以看到.系统中的cgroup subsystem都存放在subsys[].定义如下:
static struct cgroup_subsys *subsys[] = {
#include
}
即所有的subsys都定义在linux/cgroup_subsys.h中.

对照之前分析的数据结构,应该不难理解这段代码.下面来分析一下里面所遇到的一些重要的子函数.

Init_cgroup_root()代码如下:
static void init_cgroup_root(struct cgroupfs_root *root)
{
struct cgroup *cgrp = &root->top_cgroup;
INIT_LIST_HEAD(&root->subsys_list);
INIT_LIST_HEAD(&root->root_list);
root->number_of_cgroups = 1;
cgrp->root = root;
cgrp->top_cgroup = cgrp;
init_cgroup_housekeeping(cgrp);
}
它先初始化root中的几条链表.因为root中有一个top_cgroup.因此将root->number_of_cgroups置为1.然后,对root->top_cgroup进行初始化.使root->top_cgroup.root指向root. root->top_cgroup.top_cgroup指向它的本身.因为root->top_cgroup就是目录下的第一个 cgroup.
最后在init_cgroup_housekeeping()初始化cgroup的链表和读写锁.

Cgroup_init_subsys()代码如下:
static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
{
struct cgroup_subsys_state *css;

printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);

/* Create the top cgroup state for this subsystem */
ss->root = &rootnode;
css = ss->create(ss, dummytop);
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));
init_cgroup_css(css, ss, dummytop);

/* Update the init_css_set to contain a subsys
   * pointer to this state - since the subsystem is
   * newly registered, all tasks and hence the
   * init_css_set is in the subsystem's top cgroup. */
init_css_set.subsys[ss->subsys_id] = dummytop->subsys[ss->subsys_id];

need_forkexit_callback |= ss->fork || ss->exit;
need_mm_owner_callback |= !!ss->mm_owner_changed;

/* At system boot, before all subsystems have been
   * registered, no tasks have been forked, so we don't
   * need to invoke fork callbacks here. */
BUG_ON(!list_empty(&init_task.tasks));

ss->active = 1;
}
dummytop定义如下:
#define dummytop (&rootnode.top_cgroup)
在这个函数中:
1):将每个要注册的subsys->root都指向rootnode.
2):调用subsys->create()生成一个cgroup_subsys_state.
3):调用init_cgroup_css()将dummytop.subsys设置成ss->create()生成的cgroup_subsys_state
4):更新init_css_set->subsys()对应项的值.
5):将ss->active设为1.表示它已经初始化了.

4.2: cgroup_init()
cgroup_init()是cgroup的第二阶段的初始化.代码如下:
int __init cgroup_init(void)
{
int err;
int i;
struct hlist_head *hhead;

err = bdi_init(&cgroup_backing_dev_info);
if (err)
      return err;
//将剩下的(不需要在系统启动时初始化的subsys)的subsys进行初始化
for (i = 0; i
      struct cgroup_subsys *ss = subsys;
      if (!ss->early_init)
         cgroup_init_subsys(ss);
}

/* Add init_css_set to the hash table */
//将init_css_set添加到css_set_table[ ]
hhead = css_set_hash(init_css_set.subsys);
hlist_add_head(&init_css_set.hlist, hhead);
//注册cgroup文件系统
err = register_filesystem(&cgroup_fs_type);
if (err
      goto out;
//在proc文件系统的根目录下创建一个名为cgroups的文件
proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations);

out:
if (err)
      bdi_destroy(&cgroup_backing_dev_info);

return err;
}
这个函数比较简单.首先.它将剩余的subsys初始化.然后将init_css_set添加进哈希数组css_set_table[ ]中.在上面的代码中css_set_hash()是css_set_table的哈希函数.它是css_set->subsys为哈希键值,到 css_set_table[ ]中找到对应项.然后调用hlist_add_head()将init_css_set添加到冲突项中.
然后,注册了cgroup文件系统.这个文件系统也是我们在用户空间使用cgroup时必须挂载的.
最后,在proc的根目录下创建了一个名为cgroups的文件.用来从用户空间观察cgroup的状态.

经过cgroup的两个阶段的初始化, init_css_set, rootnode,subsys已经都初始化完成.表面上看起来它们很复杂,其实,它们只是表示cgroup的初始化状态而已.例如,如果 subsys->root等于rootnode,那表示subsys没有被其它的cgroup所使用.
五:父子进程之间的cgroup关联
在上面看到的代码中.将init_task.cgroup设置为了init_css_set.我们知道,init_task是系统的第一个进程.所有的过程都是由它创建的.init_task.cgroup到底会在它后面的子进程造成什么样的影响呢?接下来我们就来分析这个问题.
5.1:创建进程时的父子进程cgroup关联
在进程创建的时候,有:do_fork()àcopy_process(),有如下代码片段:
static struct task_struct *copy_process(unsigned long clone_flags,
                  unsigned long stack_start,
                  struct pt_regs *regs,
                  unsigned long stack_size,
                  int __user *child_tidptr,
                  struct pid *pid,
                  int trace)
{
……
……
cgroup_fork(p);
……
cgroup_fork_callbacks(p);
……
cgroup_post_fork(p);
……
}
上面的代码片段是创建新进程的时候与cgroup关联的函数.挨个分析如下:
void cgroup_fork(struct task_struct *child)
{
task_lock(current);
child->cgroups = current->cgroups;
get_css_set(child->cgroups);
task_unlock(current);
INIT_LIST_HEAD(&child->cg_list);
}
如上面代码所示,子进程和父进程指向同一个cgroups.并且由于增加了一次引用.所以要调用get_css_set()来增加它的引用计数.最后初始化child->cg_list链表.
如代码注释上说的,这里就有一个问题了:在dup_task_struct()为子进程创建struct task_struct的时候不是已经复制了父进程的cgroups么?为什么这里还要对它进行一次赋值呢?这里因为在 dup_task_struct()中没有持有保护锁.而这里又是一个竞争操作.因为在cgroup_attach_task()中可能会更改进程的 cgroups指向.因此通过cgroup_attach_task()所得到的cgroups可能是一个无效的指向.在递增其引用计数的时候就会因为它是一个无效的引用而发生错误.所以,这个函数在加锁的情况下进行操作.确保了父子进程之间的同步.

cgroup_fork_callbacks()代码如下,
void cgroup_fork_callbacks(struct task_struct *child)
{
if (need_forkexit_callback) {
      int i;
      for (i = 0; i
         struct cgroup_subsys *ss = subsys;
         if (ss->fork)
            ss->fork(ss, child);
      }
}
}
它主要是在进程创建时调用subsys中的跟踪函数:subsys->fork().
首先来跟踪一下need_forkexita_callback这个变量.在如下代码片段中:
static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
{
……
need_forkexit_callback |= ss->fork || ss->exit;
……
}
从这段代码中我们可以看到,如果有subsys定义了fork和exit函数,就会调need_forkexit_callback设置为1.
回到cgroup_fork_callback()这个函数中.我们发现.进程会跟所有定义了fork的subsys进行这次操作.就算进程没有在这个subsys中,也会有这个操作.

Cgroup_pos_fork()如下所示:
void cgroup_post_fork(struct task_struct *child)
{
if (use_task_css_set_links) {
      write_lock(&css_set_lock);
      if (list_empty(&child->cg_list))
         list_add(&child->cg_list, &child->cgroups->tasks);
      write_unlock(&css_set_lock);
}
在use_task_css_set_link为1的情况下.就将子进程链入到它所指向的css_set->task链表.
那什么时候会将use_task_css_set_link设置为1呢?实际上,当你往cgroup中添加进程的时候就会将其置1了.
例如我们之前举的一个例子中:
echo $$ > /dev/cgroup/eric_task/tasks
这个过程就会将use_task_css_set_link置1了.这个过程我们之后再来详细分析.

5.2:子进程结束时的操作
子进程结束的时候,有:
Do_exit() à cgroup_exit().
Cgroup_exit()代码如下:
void cgroup_exit(struct task_struct *tsk, int run_callbacks)
{
int i;
struct css_set *cg;

if (run_callbacks && need_forkexit_callback) {
      for (i = 0; i
         struct cgroup_subsys *ss = subsys;
         if (ss->exit)
            ss->exit(ss, tsk);
      }
}

/*
   * Unlink from the css_set task list if necessary.
   * Optimistically check cg_list before taking
   * css_set_lock
   */
if (!list_empty(&tsk->cg_list)) {
      write_lock(&css_set_lock);
      if (!list_empty(&tsk->cg_list))
         list_del(&tsk->cg_list);
      write_unlock(&css_set_lock);
}

/* Reassign the task to the init_css_set. */
task_lock(tsk);
cg = tsk->cgroups;
tsk->cgroups = &init_css_set;
task_unlock(tsk);
if (cg)
      put_css_set_taskexit(cg);
}
这个函数的代码逻辑比较清晰.首先,如果以1为调用参数(run_callbacks为1),且有定义了exit操作的subsys.就调用这个subsys的exit操作.
然后断开task->cg_list链表.将其从所指向的css_set->task链上断开.
最后,断开当前的cgroup指向.将其指向init_css_set.也就是将其回复到初始状态.最后,减少旧指向css_set的引用计数.

在这个函数中,我们来跟踪分析put_css_set_taskexit(),代码如下:
static inline void put_css_set_taskexit(struct css_set *cg)
{
__put_css_set(cg, 1);
}

跟踪到__put_css_set()中:
static void __put_css_set(struct css_set *cg, int taskexit)
{
int i;
/*
   * Ensure that the refcount doesn't hit zero while any readers
   * can see it. Similar to atomic_dec_and_lock(), but for an
   * rwlock
   */
if (atomic_add_unless(&cg->refcount, -1, 1))
      return;
write_lock(&css_set_lock);
if (!atomic_dec_and_test(&cg->refcount)) {
      write_unlock(&css_set_lock);
      return;
}
unlink_css_set(cg);
write_unlock(&css_set_lock);

rcu_read_lock();
for (i = 0; i
      struct cgroup *cgrp = cg->subsys->cgroup;
      if (atomic_dec_and_test(&cgrp->count) &&
         notify_on_release(cgrp)) {
         if (taskexit)
            set_bit(CGRP_RELEASABLE, &cgrp->flags);
         check_for_release(cgrp);
      }
}
rcu_read_unlock();
kfree(cg);
}
atomic_add_unless(v,a,u)表示如果v的值不为u就加a.返回1.如果v的值等于u就返回0
因此,这个函数首先减小css_set的引用计数.如果css_set的引用计数为1.就会将css_set释放掉了. 要释放css_set.首先要释放css_set上挂载的链表.再释放css_set结构本身所占空间.
释放css_set上的挂载链表是在unlink_css_set()中完成的.代码如下:
static void unlink_css_set(struct css_set *cg)
{
struct cg_cgroup_link *link;
struct cg_cgroup_link *saved_link;

hlist_del(&cg->hlist);
css_set_count--;

list_for_each_entry_safe(link, saved_link, &cg->cg_links,
               cg_link_list) {
      list_del(&link->cg_link_list);
      list_del(&link->cgrp_link_list);
      kfree(link);
}
}
它首先将cg->hlist断开,也就是将其从css_set_table[ ]中删除.然后减小css_set_count计数.最后遍历删除与css_set关联的cg_cgroup_link.
另外,在这个函数中还涉及到了notify_on_release的操作.在后面再来详细分析这一过程.这里先把它放一下.
六:cgroup文件系统的挂载
Cgroup文件系统定义如下:
static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.get_sb = cgroup_get_sb,
.kill_sb = cgroup_kill_sb,
}
根据我们之前有关linux文件系统系列的文析.在挂载文件系统的时候,流程会流入file_system_type.get_sb().也就是cgroup_get_sb().由于该代码较长.分段分析如下:
static int cgroup_get_sb(struct file_system_type *fs_type,
         int flags, const char *unused_dev_name,
         void *data, struct vfsmount *mnt)
{
struct cgroup_sb_opts opts;
int ret = 0;
struct super_block *sb;
struct cgroupfs_root *root;
struct list_head tmp_cg_links;

/* First find the desired set of subsystems */
//解析挂载参数
ret = parse_cgroupfs_options(data, &opts);
if (ret) {
      if (opts.release_agent)
         kfree(opts.release_agent);
      return ret;
}
在这一部份,解析挂载的参数,并将解析的结果存放到opts.opts-> subsys_bits表示指定关联的subsys位图,opts->flags:挂载的标志: opts->release_agent表示指定的release_agent路径.

//分配并初始化cgroufs_root
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
      if (opts.release_agent)
         kfree(opts.release_agent);
      return -ENOMEM;
}

init_cgroup_root(root);
/*root->subsys_bits: 该hierarchy上关联的subsys*/
root->subsys_bits = opts.subsys_bits;
root->flags = opts.flags;
/*如果带了release_agent参数,将其copy到root0
if (opts.release_agent) {
      strcpy(root->release_agent_path, opts.release_agent);
      kfree(opts.release_agent);
}

/*初始化一个super block*/
sb = sget(fs_type, cgroup_test_super, cgroup_set_super, root);

/*如果发生错误*/
if (IS_ERR(sb)) {
      kfree(root);
      return PTR_ERR(sb);
}
在这一部份,主要分配并初始化了一个cgroupfs_root结构.里面的子函数init_cgroup_root()我们在之前已经分析过,这里不再赘述.其实的初始化包括:设置与之关联的subsys位图,挂载标志和release_agent路径.然后再调用sget()生成一个 super_block结构.调用cgroup_test_super来判断系统中是否有机同的cgroups_root.调用 cgroup_set_super来对super_block进行初始化.
在cgroup_set_super()中,将sb->s_fs_info 指向了cgroutfs_root,cgroufs_root.sb指向生成的super_block.
类似的.如果找到的super_block相关联的cgroupfs_root所表示的subsys_bits和flags与当前cgroupfs_root相同的话,就表示是一个相同的super_block.因为它们的挂载参数是一样的.
举个例子来说明一下有重复super_block的情况:
[root@localhost ~]# mount -t cgroup cgroup -o debug /dev/cgroup/
[root@localhost ~]# mount -t cgroup cgroup -o debug /dev/eric_cgroup/
在上面的例子中,在挂载到/dev/eric_cgroup目录的时候,就会找到一个相同的super_block.这样实例上两者的操作是一样的.这两个不同挂载点所代码的vfsmount会找到同一个super_block.也就是说对其中一个目录的操作都会同表现在另一个目录中.

/*重复挂载*/
if (sb->s_fs_info != root) {
      /* Reusing an existing superblock */
      BUG_ON(sb->s_root == NULL);
      kfree(root);
      root = NULL;
} else {
      /* New superblock */
      struct cgroup *cgrp = &root->top_cgroup;
      struct inode *inode;
      int i;

      BUG_ON(sb->s_root != NULL);
      /*初始化super_block对应的dentry和inode*/
      ret = cgroup_get_rootdir(sb);
      if (ret)
         goto drop_new_super;
      inode = sb->s_root->d_inode;

      mutex_lock(&inode->i_mutex);
      mutex_lock(&cgroup_mutex);

      /*
      * We're accessing css_set_count without locking
      * css_set_lock here, but that's OK - it can only be
      * increased by someone holding cgroup_lock, and
      * that's us. The worst that can happen is that we
      * have some link structures left over
      */
      /*分配css_set_count个cg_cgroup_link并将它们链入到tmp_cg_links*/
      ret = allocate_cg_links(css_set_count, &tmp_cg_links);
      if (ret) {
         mutex_unlock(&cgroup_mutex);
         mutex_unlock(&inode->i_mutex);
         goto drop_new_super;
      }
      /*bind subsys 到hierarchy*/
      ret = rebind_subsystems(root, root->subsys_bits);
      if (ret == -EBUSY) {
         mutex_unlock(&cgroup_mutex);
         mutex_unlock(&inode->i_mutex);
         goto drop_new_super;
      }

      /* EBUSY should be the only error here */
      BUG_ON(ret);
      /*将root添加到roots链入.增加root_count计数*/
      list_add(&root->root_list, &roots);
      root_count++;

      /*将挂载根目录dentry的私有结构d_fsdata反映向root->top_cgroup*/
      /*将root->top_cgroup.dentry指向挂载的根目录*/
      sb->s_root->d_fsdata = &root->top_cgroup;
      root->top_cgroup.dentry = sb->s_root;

      /* Link the top cgroup in this hierarchy into all
      * the css_set objects */
      /*将所有的css_set都和root->top_cgroup关联起来*/
      write_lock(&css_set_lock);
      for (i = 0; i
         struct hlist_head *hhead = &css_set_table;
         struct hlist_node *node;
         struct css_set *cg;

         hlist_for_each_entry(cg, node, hhead, hlist) {
            struct cg_cgroup_link *link;

            BUG_ON(list_empty(&tmp_cg_links));
            link = list_entry(tmp_cg_links.next,
                        struct cg_cgroup_link,
                        cgrp_link_list);
            list_del(&link->cgrp_link_list);
            link->cg = cg;
            list_add(&link->cgrp_link_list,
                  &root->top_cgroup.css_sets);
            list_add(&link->cg_link_list, &cg->cg_links);
         }
      }
      write_unlock(&css_set_lock);
      /*释放tmp_cg_links的多余项*/
      free_cg_links(&tmp_cg_links);

      BUG_ON(!list_empty(&cgrp->sibling));
      BUG_ON(!list_empty(&cgrp->children));
      BUG_ON(root->number_of_cgroups != 1);
      /*在root->top_cgroup下面创建一些文件,包括cgroup共有的和subsys私有的文件*/
      cgroup_populate_dir(cgrp);
      mutex_unlock(&inode->i_mutex);
      mutex_unlock(&cgroup_mutex);
}
/*将vfsmount和super_block关联起来*/
return simple_set_mnt(mnt, sb);

drop_new_super:
up_write(&sb->s_umount);
deactivate_super(sb);
free_cg_links(&tmp_cg_links);
return ret;
}
这一部份,首先判断找到的super_block是不是之前就存在的.如果是已经存在的,那就用不着再初始化一个cgroupfs_root结构了.将之前分配的结构释放掉.然后调用simple_set_mnt()将取得的super_block和vfsmount相关联后退出.
如果super_block是一个新建的.那么就必须要继续初始化cgroupfs_root了.
首先,调用cgroup_get_rootdir()初始化super_block对应的dentry和inode.
然后,调用rebind_subsystems()将需要关联到hierarchy的subsys和root->top_cgroup绑定起来.
最后,将所有的css_set都和root->top_cgroup关联起来.这样就可以从root->top_cgroup找到所有的进程了.再调用cgroup_populate_dir()在挂载目录下创建一些文件,然后,调用simple_set_mnt()将取得的 super_block和vfsmount相关联后退出.

这个函数的流程还算简单.下面来分析一下里面涉及到的重要的子函数:
6.1: parse_cgroupfs_options()函数分析
这个函数主要是对挂载的参数进行解析.函数代码如下:
static int parse_cgroupfs_options(char *data,
                  struct cgroup_sb_opts *opts)
{
/*如果挂载的时候没有带参数,将o设为"all".表示将所有
   *的subsys都与之关联
   */
char *token, *o = data ?: "all";

opts->subsys_bits = 0;
opts->flags = 0;
opts->release_agent = NULL;

/*各参数是以","分隔的*/
while ((token = strsep(&o, ",")) != NULL) {
      if (!*token)
         return -EINVAL;
      /*如果为all.表示关联所有的subsys*/
      if (!strcmp(token, "all")) {
         /* Add all non-disabled subsystems */
         int i;
         opts->subsys_bits = 0;
         for (i = 0; i
            struct cgroup_subsys *ss = subsys;
            if (!ss->disabled)
                  opts->subsys_bits |= 1ul
         }
      }
      /*如果指定参数noprefix.设定ROOT_NOPREFIX标志*/
      /*在指定noprefix的情况下.subsys创建的文件不会带subsys名称的前缀*/
      else if (!strcmp(token, "noprefix")) {
         set_bit(ROOT_NOPREFIX, &opts->flags);
      }
      /*如果指定了release_agent.分opt->release_agent分配内存,并将参数copy到里面*/
      else if (!strncmp(token, "release_agent=", 14)) {
         /* Specifying two release agents is forbidden */
         if (opts->release_agent)
            return -EINVAL;
         opts->release_agent = kzalloc(PATH_MAX, GFP_KERNEL);
         if (!opts->release_agent)
            return -ENOMEM;
         strncpy(opts->release_agent, token + 14, PATH_MAX - 1);
         opts->release_agent[PATH_MAX - 1] = 0;
      }
      /*其它情况下,将所带参数做为一个susys名处理.到sussys[]找到
         *对应的subsys.然后将opts->subsys_bits中的位置1
         */
      else {
         struct cgroup_subsys *ss;
         int i;
         for (i = 0; i
            ss = subsys;
            if (!strcmp(token, ss->name)) {
                  if (!ss->disabled)
                     set_bit(i, &opts->subsys_bits);
                  break;
            }
         }
         if (i == CGROUP_SUBSYS_COUNT)
            return -ENOENT;
      }
}

/* We can't have an empty hierarchy */
/*如果没有关联到subsys.错误*/
if (!opts->subsys_bits)
      return -EINVAL;

return 0;
}
对照代码中添加的注释应该很容易看懂.这里就不再做详细分析了.

6.2: rebind_subsystems()函数分析
rebind_subsystems()用来将cgroupfs_root和subsys绑定.代码如下:
static int rebind_subsystems(struct cgroupfs_root *root,
               unsigned long final_bits)
{
unsigned long added_bits, removed_bits;
struct cgroup *cgrp = &root->top_cgroup;
int i;

/*root->actual_subsys_bits表示当进root中所关键的subsys位图*/
   /*如果在root->actual_subsys_bits中.但没有在final_bits中.表示这是
*一次remonut的操作.需要将旧的subsys移除.如果在final_bits中
*存在,但没有在root->actual_subsys_bits中,表示是需要添加的.
*/
removed_bits = root->actual_subsys_bits & ~final_bits;
added_bits = final_bits & ~root->actual_subsys_bits;
/* Check that any added subsystems are currently free */
   /*如果要关联的subsys已经在其它的hierarchy中了.失败.
*如果ss->root != &rootnode表示ss已经链入了其它的cgroupfs_root
*/
for (i = 0; i
      unsigned long bit = 1UL
      struct cgroup_subsys *ss = subsys;
      if (!(bit & added_bits))
         continue;
      if (ss->root != &rootnode) {
         /* Subsystem isn't free */
         return -EBUSY;
      }
}

/* Currently we don't handle adding/removing subsystems when
   * any child cgroups exist. This is theoretically supportable
   * but involves complex error handling, so it's being left until
   * later */
   /*如果root->top_cgroup->children不为空.表示该hierarchy还要其它的cgroup
*是不能被remount的.(新挂载的root->top_cgroup在初始化的时候将children置空了)
*/
if (!list_empty(&cgrp->children))
      return -EBUSY;

/* Process each subsystem */
for (i = 0; i
      struct cgroup_subsys *ss = subsys;
      unsigned long bit = 1UL
      /*添加subsys的情况*/
      if (bit & added_bits) {
         /* We're binding this subsystem to this hierarchy */
         /* 添加情况下.将cgrp->subsys指向dummytop->subsys
            * 并更新dummytop->subsys->root.将其指向要添加的root
            * 最后调用subsys->bind()操作
            */
         BUG_ON(cgrp->subsys);
         BUG_ON(!dummytop->subsys);
         BUG_ON(dummytop->subsys->cgroup != dummytop);
         cgrp->subsys = dummytop->subsys;
         cgrp->subsys->cgroup = cgrp;
         list_add(&ss->sibling, &root->subsys_list);
         rcu_assign_pointer(ss->root, root);
         if (ss->bind)
            ss->bind(ss, cgrp);

      }
      /*移除subsys的情况*/
      else if (bit & removed_bits) {
         /* 移除操作,将对应的cgroup_subsys_state回归到原来的样子.并且也需要
            * 将与其subsys bind
            */
         /* We're removing this subsystem */
         BUG_ON(cgrp->subsys != dummytop->subsys);
         BUG_ON(cgrp->subsys->cgroup != cgrp);
         if (ss->bind)
            ss->bind(ss, dummytop);
         dummytop->subsys->cgroup = dummytop;
         cgrp->subsys = NULL;
         rcu_assign_pointer(subsys->root, &rootnode);
         list_del(&ss->sibling);
      } else if (bit & final_bits) {
         /* Subsystem state should already exist */
         BUG_ON(!cgrp->subsys);
      } else {
         /* Subsystem state shouldn't exist */
         BUG_ON(cgrp->subsys);
      }
}
/*更新root的位图*/
root->subsys_bits = root->actual_subsys_bits = final_bits;
synchronize_rcu();

return 0;
}
从这个函数也可以看出来.rootnode就是起一个参照的作用.用来判断subsys是否处于初始化状态.

6.3: cgroup_populate_dir()函数分析
cgroup_populate_dir()用来在挂载目录下创建交互文件.代码如下:
static int cgroup_populate_dir(struct cgroup *cgrp)
{
int err;
struct cgroup_subsys *ss;

/* First clear out any existing files */
/*先将cgrp所在的目录清空*/
cgroup_clear_directory(cgrp->dentry);

/*创建files所代码的几个文件*/
err = cgroup_add_files(cgrp, NULL, files, ARRAY_SIZE(files));
if (err
      return err;
/*如果是顶层top_cgroup.创建cft_release_agent所代码的文件*/
if (cgrp == cgrp->top_cgroup) {
      if ((err = cgroup_add_file(cgrp, NULL, &cft_release_agent))
         return err;
}

/*对所有与cgrp->root关联的subsys都调用populate()*/
for_each_subsys(cgrp->root, ss) {
      if (ss->populate && (err = ss->populate(ss, cgrp))
         return err;
}

return 0;
}
这个函数比较简单.跟踪cgroup_add_file().如下:
nt cgroup_add_file(struct cgroup *cgrp,
            struct cgroup_subsys *subsys,
            const struct cftype *cft)
{
struct dentry *dir = cgrp->dentry;
struct dentry *dentry;
int error;

char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 };
/*如果有指定subsys.且没有使用ROOT_NOPREFIX标志.需要在名称前加上
   *subsys的名称
   */
if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
      strcpy(name, subsys->name);
      strcat(name, ".");
}
/*将cft->name链接到name代表的字串后面*/
strcat(name, cft->name);
BUG_ON(!mutex_is_locked(&dir->d_inode->i_mutex));
/*到cgroup所在的目录下寻找name所表示的dentry,如果不存在,则新建之*/
dentry = lookup_one_len(name, dir, strlen(name));
if (!IS_ERR(dentry)) {
      /*创建文件inode*/
      error = cgroup_create_file(dentry, 0644 | S_IFREG,
                     cgrp->root->sb);
      /*使dentry->d_fsdata指向文件所代表的cftype*/
      if (!error)
         dentry->d_fsdata = (void *)cft;
      dput(dentry);
} else
      error = PTR_ERR(dentry);
return error;
}

cgroup_create_file()函数代码如下:
static int cgroup_create_file(struct dentry *dentry, int mode,
            struct super_block *sb)
{
static struct dentry_operations cgroup_dops = {
      .d_iput = cgroup_diput,
};

struct inode *inode;

if (!dentry)
      return -ENOENT;
if (dentry->d_inode)
      return -EEXIST;
/*分配一个inode*/
inode = cgroup_new_inode(mode, sb);
if (!inode)
      return -ENOMEM;
/*如果新建的是目录*/
if (S_ISDIR(mode)) {
      inode->i_op = &cgroup_dir_inode_operations;
      inode->i_fop = &simple_dir_operations;

      /* start off with i_nlink == 2 (for "." entry) */
      inc_nlink(inode);

      /* start with the directory inode held, so that we can
      * populate it without racing with another mkdir */
      mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
}
/*新建一般文件*/
else if (S_ISREG(mode)) {
      inode->i_size = 0;
      inode->i_fop = &cgroup_file_operations;
}
dentry->d_op = &cgroup_dops;
/*将dentry和inode关联起来*/
d_instantiate(dentry, inode);
dget(dentry); /* Extra count - pin the dentry in core */
return 0;
}
从这个函数我们可以看到.如果是目录的话,对应的操作集为simple_dir_operations和 cgroup_dir_inode_operations.它与cgroup_get_rootdir()中对根目录对应的inode所设置的操作集是一样的.如果是一般文件,它的操作集为cgroup_file_operations.
在这里,先将cgroup中的文件操作放到一边,我们在之后再来详细分析这个过程.
现在.我们已经将cgroup文件系统的挂载分析完成.接下来看它下面子层cgroup的创建.

七:创建子层cgroup
在目录下通过mkdir调用就可以创建一个子层cgroup.下面就分析这一过程:
经过上面的分析可以得知,cgroup中目录的操作集为: cgroup_dir_inode_operations.结构如下:
static struct inode_operations cgroup_dir_inode_operations = {
.lookup = simple_lookup,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.rename = cgroup_rename,
};
从上面看到,对应mkdir的入口为cgroup_mkdir().代码如下:
static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
/*找到它的上一级cgroup*/
struct cgroup *c_parent = dentry->d_parent->d_fsdata;

/* the vfs holds inode->i_mutex already */
/*调用cgroup_create创建cgroup*/
return cgroup_create(c_parent, dentry, mode | S_IFDIR);
}
跟踪cgroup_create().代码如下:
static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
               int mode)
{
struct cgroup *cgrp;
struct cgroupfs_root *root = parent->root;
int err = 0;
struct cgroup_subsys *ss;
struct super_block *sb = root->sb;
/*分配并初始化一个cgroup*/
cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
      return -ENOMEM;

/* Grab a reference on the superblock so the hierarchy doesn't
   * get deleted on unmount if there are child cgroups.  This
   * can be done outside cgroup_mutex, since the sb can't
   * disappear while someone has an open control file on the
   * fs */
atomic_inc(&sb->s_active);

mutex_lock(&cgroup_mutex);

init_cgroup_housekeeping(cgrp);

/*设置cgrp的层次关系*/
cgrp->parent = parent;
cgrp->root = parent->root;
cgrp->top_cgroup = parent->top_cgroup;

/*如果上一级cgroup设置了CGRP_NOTIFY_ON_RELEASE.那cgrp也设置这个标志*/
if (notify_on_release(parent))
      set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);

/*调用subsys_create()生成cgroup_subsys_state.并与cgrp相关联*/
for_each_subsys(root, ss) {
      struct cgroup_subsys_state *css = ss->create(ss, cgrp);
      if (IS_ERR(css)) {
         err = PTR_ERR(css);
         goto err_destroy;
      }
      init_cgroup_css(css, ss, cgrp);
}

/*将cgrp添加到上一层cgroup的children链表*/
list_add(&cgrp->sibling, &cgrp->parent->children);
/*增加root的cgroups数目计数*/
root->number_of_cgroups++;
/*在当前目录生成一个目录*/
err = cgroup_create_dir(cgrp, dentry, mode);
if (err
      goto err_remove;

/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
/*在cgrp下创建几个交互文件*/
err = cgroup_populate_dir(cgrp);
/* If err

mutex_unlock(&cgroup_mutex);
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);

return 0;

err_remove:

list_del(&cgrp->sibling);
root->number_of_cgroups--;

err_destroy:

for_each_subsys(root, ss) {
      if (cgrp->subsys[ss->subsys_id])
         ss->destroy(ss, cgrp);
}

mutex_unlock(&cgroup_mutex);

/* Release the reference count that we took on the superblock */
deactivate_super(sb);

kfree(cgrp);
return err;
}
在这个函数中,主要分配并初始化了一个cgroup结构.并且将它和它的上一层目录以及整个cgroupfs_root构成一个空间层次关系.然后,再调用subsys>create()操作函数.来让subsys知道已经创建了一个cgroup结构.
为了理顺这一部份.将前面分析的cgroup文件系统挂载和cgroup的创建.以及接下来要分析的attach_task()操作总结成一个图.如下示:
Click here to open new window
CTRL+Mouse wheel to zoom in/out

Click here to open new window
CTRL+Mouse wheel to zoom in/out

八:cgroup中文件的操作
接下来,就来看cgroup文件的操作.在上面曾分析到:文件对应的操作集为cgroup_file_operations.如下所示:
static struct file_operations cgroup_file_operations = {
.read = cgroup_file_read,
.write = cgroup_file_write,
.llseek = generic_file_llseek,
.open = cgroup_file_open,
.release = cgroup_file_release,
}

7.1:cgrou文件的open操作
对应的函数为cgroup_file_open().代码如下:
static int cgroup_file_open(struct inode *inode, struct file *file)
{
int err;
struct cftype *cft;

err = generic_file_open(inode, file);
if (err)
      return err;

/*取得文件对应的struct cftype*/
cft = __d_cft(file->f_dentry);
if (!cft)
      return -ENODEV;
/*如果定义了read_map或者是read_seq_string*/
if (cft->read_map || cft->read_seq_string) {
      struct cgroup_seqfile_state *state =
         kzalloc(sizeof(*state), GFP_USER);
      if (!state)
         return -ENOMEM;
      state->cft = cft;
      state->cgroup = __d_cgrp(file->f_dentry->d_parent);
      file->f_op = &cgroup_seqfile_operations;
      err = single_open(file, cgroup_seqfile_show, state);
      if (err
         kfree(state);
}
/*否则调用cft->open()*/
else if (cft->open)
      err = cft->open(inode, file);
else
      err = 0;

return err;
}
有两种情况.一种是定义了read_map或者是read_seq_string的情况.这种情况下,它对应的操作集为 cgroup_seqfile_operations.如果是其它的情况.调用cftype的open()函数.第一种情况,我们等以后遇到了这样的情况再来详细分析.

7.2:cgroup文件的read操作
对应函数为cgroup_file_read().代码如下:
static ssize_t cgroup_file_read(struct file *file, char __user *buf,
               size_t nbytes, loff_t *ppos)
{
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);

if (!cft || cgroup_is_removed(cgrp))
      return -ENODEV;

if (cft->read)
      return cft->read(cgrp, cft, file, buf, nbytes, ppos);
if (cft->read_u64)
      return cgroup_read_u64(cgrp, cft, file, buf, nbytes, ppos);
if (cft->read_s64)
      return cgroup_read_s64(cgrp, cft, file, buf, nbytes, ppos);
return -EINVAL;
}
如上代码所示.read操作会转入到cftype的read()或者read_u64或者read_s64的函数中.

7.3:cgroup文件的wirte操作
对应的操作函数是cgroup_file_write().如下示:
static ssize_t cgroup_file_write(struct file *file, const char __user *buf,
                     size_t nbytes, loff_t *ppos)
{
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);

if (!cft || cgroup_is_removed(cgrp))
      return -ENODEV;
if (cft->write)
      return cft->write(cgrp, cft, file, buf, nbytes, ppos);
if (cft->write_u64 || cft->write_s64)
      return cgroup_write_X64(cgrp, cft, file, buf, nbytes, ppos);
if (cft->write_string)
      return cgroup_write_string(cgrp, cft, file, buf, nbytes, ppos);
if (cft->trigger) {
      int ret = cft->trigger(cgrp, (unsigned int)cft->private);
      return ret ? ret : nbytes;
}
return -EINVAL;
}
从上面可以看到.最终的操作会转入到cftype的write或者wirte_u64或者wirte_string或者trigger函数中.

7.4:debug subsytem分析
以debug subsystem为例来说明cgroup中的文件操作
Debug subsys定义如下:
struct cgroup_subsys debug_subsys = {
.name = "debug",
.create = debug_create,
.destroy = debug_destroy,
.populate = debug_populate,
.subsys_id = debug_subsys_id,
}
在cgroup_init_subsys()中,会以dummytop为参数调用debug.create().对应函数为debug_create().代码如下:
static struct cgroup_subsys_state *debug_create(struct cgroup_subsys *ss,
                        struct cgroup *cont)
{
struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);

if (!css)
      return ERR_PTR(-ENOMEM);

return css;
}
这里没啥好说的,就是分配了一个cgroup_subsys_state结构.

然后,将cgroup挂载.指令如下:
[root@localhost ~]# mount -t cgroup cgroup -o debug /dev/cgroup/
在rebind_subsystems()中,会调用subsys的bind函数.但在debug中无此接口.故不需要考虑.
然后在cgroup_populate_dir()中会调用populate接口.对应函数为debug_populate().代码如下:
static int debug_populate(struct cgroup_subsys *ss, struct cgroup *cont)
{
return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
}
Debug中的files定义如下:
static struct cftype files[] =  {
{
      .name = "cgroup_refcount",
      .read_u64 = cgroup_refcount_read,
},
{
      .name = "taskcount",
      .read_u64 = taskcount_read,
},

{
      .name = "current_css_set",
      .read_u64 = current_css_set_read,
},

{
      .name = "current_css_set_refcount",
      .read_u64 = current_css_set_refcount_read,
},

{
      .name = "releasable",
      .read_u64 = releasable_read,
},
}
来观察一下 /dev/cgroup下的文件:
[root@localhost ~]# tree /dev/cgroup/
/dev/cgroup/
|-- debug.cgroup_refcount
|-- debug.current_css_set
|-- debug.current_css_set_refcount
|-- debug.releasable
|-- debug.taskcount
|-- notify_on_release
|-- release_agent
`-- tasks

0 directories, 8 files
上面带debug字样的文件是从debug subsys中创建的.其它的是cgroup.c的files中创建的.
我们先来分析每一个subsys共有的文件.即tasks,release_agent和notify_on_release.

7.5:task文件操作
Tasks文件对应的cftype结构如下:
static struct cftype files[] = {
{
      .name = "tasks",
      .open = cgroup_tasks_open,
      .write_u64 = cgroup_tasks_write,
      .release = cgroup_tasks_release,
      .private = FILE_TASKLIST,
}

7.5.1:task文件的open操作
当打开文件时,流程就会转入cgroup_tasks_open().代码如下:
static int cgroup_tasks_open(struct inode *unused, struct file *file)
{
/*取得该文件所在层次的cgroup*/
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
pid_t *pidarray;
int npids;
int retval;

/* Nothing to do for write-only files */
/*如果是只写的文件系统*/
if (!(file->f_mode & FMODE_READ))
      return 0;

/*
   * If cgroup gets more users after we read count, we won't have
   * enough space - tough.  This race is indistinguishable to the
   * caller from the case that the additional cgroup users didn't
   * show up until sometime later on.
   */
   /*得到该层cgroup所关联的进程个数*/
npids = cgroup_task_count(cgrp);
/*为npids个进程的pid存放分配空间*/
pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
if (!pidarray)
      return -ENOMEM;
/* 将与cgroup关联进程的pid存放到pid_array_load数组.
   * 并且按照从小到大的顺序排列
   */
npids = pid_array_load(pidarray, npids, cgrp);
sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);

/*
   * Store the array in the cgroup, freeing the old
   * array if necessary
   */
   /* 将npids,pidarray信息存放到cgroup中.如果cgroup之前
   * 就有task_pids.将其占放的空间释放
   */
down_write(&cgrp->pids_mutex);
kfree(cgrp->tasks_pids);
cgrp->tasks_pids = pidarray;
cgrp->pids_length = npids;
cgrp->pids_use_count++;
up_write(&cgrp->pids_mutex);

/*将文件对应的操作集更改为cgroup_task_operations*/
file->f_op = &cgroup_tasks_operations;

retval = seq_open(file, &cgroup_tasks_seq_operations);
/*如果操作失败,将cgroup中的pid信息释放*/
if (retval) {
      release_cgroup_pid_array(cgrp);
      return retval;
}
((struct seq_file *)file->private_data)->private = cgrp;
return 0;
}
首先,我们来思考一下这个问题:怎么得到与cgroup关联的进程呢?
回到在上面列出来的数据结构关系图.每个进程都会指向一个css_set.而与这个css_set关联的所有进程都会链入到 css_set->tasks链表.而cgroup又可能通过一个中间结构cg_cgroup_link来寻找所有与之关联的所有css_set. 从而可以得到与cgroup关联的所有进程.
在上面的代码中,通过调用cgroup_task_count()来得到与之关联的进程数目,代码如下:
int cgroup_task_count(const struct cgroup *cgrp)
{
int count = 0;
struct cg_cgroup_link *link;

read_lock(&css_set_lock);
list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
      count += atomic_read(&link->cg->refcount);
}
read_unlock(&css_set_lock);
return count;
}
它就是遍历cgro->css_sets.并调其转换为cg_cgroup_link.再从这个link得到css_set.这个css_set的引用计数就是与这个指向这个css_set的task数目.

在代码中,是通过pid_array_load()来得到与cgroup关联的task,并且将进程的pid写入数组pidarray中.代码如下:
static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
{
int n = 0;
struct cgroup_iter it;
struct task_struct *tsk;
cgroup_iter_start(cgrp, &it);
while ((tsk = cgroup_iter_next(cgrp, &it))) {
      if (unlikely(n == npids))
         break;
      pidarray[n++] = task_pid_vnr(tsk);
}
cgroup_iter_end(cgrp, &it);
return n;
}
我们在这里遇到了一个新的结构:struct cgroup_iter.它是cgroup的一个迭代器,通过它可以遍历取得与cgroup关联的task.它的使用方法为:
1:调用cgroup_iter_start()来初始化这个迭代码.
2:调用cgroup_iter_next()用来取得cgroup中的下一个task
3:使用完了,调用cgroup_iner_end().
下面来分析这三个过程:
Cgroup_iter_start()代码如下:
void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
{
/*
   * The first time anyone tries to iterate across a cgroup,
   * we need to enable the list linking each css_set to its
   * tasks, and fix up all existing tasks.
   */
if (!use_task_css_set_links)
      cgroup_enable_task_cg_lists();

read_lock(&css_set_lock);
it->cg_link = &cgrp->css_sets;
cgroup_advance_iter(cgrp, it);
}
我们在这里再次遇到了use_task_css_set_links变量.在之前分析cgroup_post_fork()中的时候,我们曾说过,只有在 use_task_css_set_link设置为1的时候,才会调task->cg_list链入到css_set->tasks中.
所以,在这个地方,如果use_task_css_set_link为0.那就必须要将之前所有的进程都链入到它所指向的 css_set->tasks链表.这个过程是在cgroup_enable_task_cg_lists()完成的,这个函数相当简单,就是一个 task的遍历,然后就是链表的链入,在这里就不再详细分析了.请自行阅读它的代码.*^_^*
然后,将it->cg_link指向cgrp->css_sets.我们在前面说过,可以通过cgrp->css_sets就可以得得所有的与cgroup关联的css_set.
到这里,这个迭代器里面还是空的,接下来往里面填充数据.这个过程是在cgroup_advance_iter()中完成,代码如下示:
static void cgroup_advance_iter(struct cgroup *cgrp,
               struct cgroup_iter *it)
{
struct list_head *l = it->cg_link;
struct cg_cgroup_link *link;
struct css_set *cg;

/* Advance to the next non-empty css_set */
do {
l = l->next;
if (l == &cgrp->css_sets) {
      it->cg_link = NULL;
      return;
}
link = list_entry(l, struct cg_cgroup_link, cgrp_link_list);
cg = link->cg;
} while (list_empty(&cg->tasks));
it->cg_link = l;
it->task = cg->tasks.next;
}
通过前面的分析可得知,可通过it->cg_link找到与之关联的css_set,然后再通过css_set找到与它关联的task链表.因此每次往cgroup迭代器里填充数据,就是找到一个tasks链表不为空的css_set.取数据就从css_set->tasks中取.如果数据取完了,就找下一个tasks链表不为空的css_set.
这样,这个函数的代码就很简单了.它就是找到it->cg_link上tasks链表不为空的css_set项.

cgroup_iter_next()的代码如下:
struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
                  struct cgroup_iter *it)
{
struct task_struct *res;
struct list_head *l = it->task;

/* If the iterator cg is NULL, we have no tasks */
if (!it->cg_link)
      return NULL;
res = list_entry(l, struct task_struct, cg_list);
/* Advance iterator to find next entry */
l = l->next;
if (l == &res->cgroups->tasks) {
      /* We reached the end of this task list - move on to
      * the next cg_cgroup_link */
      cgroup_advance_iter(cgrp, it);
} else {
      it->task = l;
}
return res;
}
如果it->cg_link为空表示it->cg_link已经遍历完了,也就不存放在task了.否则,从it->task中取得 task.如果已经是最后一个task就必须要调用cgroup_advance_iter()填充迭代器里面的数据.最后将取得的task返回.

cgroup_iter_end()用来对迭代码进行收尾的工作,代码如下:
void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it)
{
read_unlock(&css_set_lock);
}
它就是释放了在cgroup_iter_start()中持有的锁.

回到cgroup_tasks_open()中.我们接下来会遇到kernel为sequential file提供的一组接口.首先在代码遇到的是seq_open().代码如下:
int seq_open(struct file *file, const struct seq_operations *op)
{
struct seq_file *p = file->private_data;

if (!p) {
      p = kmalloc(sizeof(*p), GFP_KERNEL);
      if (!p)
         return -ENOMEM;
      file->private_data = p;
}
memset(p, 0, sizeof(*p));
mutex_init(&p->lock);
p->op = op;
file->f_version = 0;
/* SEQ files support lseek, but not pread/pwrite */
file->f_mode &= ~(FMODE_PREAD | FMODE_PWRITE);
return 0;
}
从代码中可以看出,它就是初始化了一个struct seq_file结构.并且将其关联到file->private_data.在这里要注意将seq_file->op设置成了参数op.在我们分析的这个情景中,也就是cgroup_tasks_seq_operations.这个在我们分析文件的读操作的时候会用到的.

7.5.2:task文件的read操作
从上面的代码中可看到.在open的时候,更改了file->f_op.将其指向了cgroup_tasks_operations.该结构如下:
static struct file_operations cgroup_tasks_operations = {
.read = seq_read,
.llseek = seq_lseek,
.write = cgroup_file_write,
.release = cgroup_tasks_release,
}
相应的,read操作就会转入到seq_read()中.由于该函数篇幅较大,这里就不列出了.感兴趣的可以自己跟踪看一下,其它就是循环调用 seq_file->op->start() à seq_file->op->show() à seq_file->op->next() à seq_file->op->stop()的过程.
我们在上面分析task文件的open操作的时候,曾经提配过,seq_file->op被指向了cgroup_tasks_seq_operations.定义如下:
static struct seq_operations cgroup_tasks_seq_operations = {
.start = cgroup_tasks_start,
.stop = cgroup_tasks_stop,
.next = cgroup_tasks_next,
.show = cgroup_tasks_show,
}
Cgroup_tasks_start()代码如下:
static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
{
/*
   * Initially we receive a position value that corresponds to
   * one more than the last pid shown (or 0 on the first call or
   * after a seek to the start). Use a binary-search to find the
   * next pid to display, if any
   */
struct cgroup *cgrp = s->private;
int index = 0, pid = *pos;
int *iter;

down_read(&cgrp->pids_mutex);
if (pid) {
      int end = cgrp->pids_length;

      while (index
         int mid = (index + end) / 2;
         if (cgrp->tasks_pids[mid] == pid) {
            index = mid;
            break;
         } else if (cgrp->tasks_pids[mid]
            index = mid + 1;
         else
            end = mid;
      }
}
/* If we're off the end of the array, we're done */
if (index >= cgrp->pids_length)
      return NULL;
/* Update the abstract position to be the actual pid that we found */
iter = cgrp->tasks_pids + index;
*pos = *iter;
return iter;
}
它以二分法从cgrp->tasks_pids[ ]中去寻找第一个大于或者等于参数*pos值的项.如果找到了,返回该项.如果没找到.返回NULL.

cgroup_tasks_show()代码如下:
static int cgroup_tasks_show(struct seq_file *s, void *v)
{
return seq_printf(s, "%d\n", *(int *)v);
}
它就是将pid转换为了字符串.

cgroup_tasks_next()就是找到数组中的下一项.代码如下:
static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
{
struct cgroup *cgrp = s->private;
int *p = v;
int *end = cgrp->tasks_pids + cgrp->pids_length;

/*
   * Advance to the next pid in the array. If this goes off the
   * end, we're done
   */
p++;
if (p >= end) {
      return NULL;
} else {
      *pos = *p;
      return p;
}
}

cgroup_tasks_stop()代码如下:
static void cgroup_tasks_stop(struct seq_file *s, void *v)
{
struct cgroup *cgrp = s->private;
up_read(&cgrp->pids_mutex);
}
它只是释放了在cgroup_tasks_start()中持有的读写锁.

7.5.3:task文件的close操作
Task文件close时,调用的相应接口为cgroup_tasks_release().代码如下:
static int cgroup_tasks_release(struct inode *inode, struct file *file)
{
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);

if (!(file->f_mode & FMODE_READ))
      return 0;

release_cgroup_pid_array(cgrp);
return seq_release(inode, file);
}
它就是将cgroup中的pid信息与seqfile信息释放掉.

到这里,我们已经分析完了task文件的open,read,close操作.我们现在就可以实现一下,看上面的分析是否正确.
在前面已经分析中cgroupfs_root.top_cgroup会将系统中的所有css_set与之关联起来,那么通过 cgroupfs_root_top_cgroup找到的进程应该是系统当前的所有进程.那么相应的,在挂载目录的task文件的内容.应该是系统中所有进程的pid.
如下所示:
[root@localhost cgroup]# cat tasks
1
2
3
………
………
2578
其实,这样做是cgroup子系统开发者特意设置的.它表示所有的进程都在hierarchy的控制之下.
反过来,当我们在挂载目录mkdir一个目录,它下面的task文件内容应该是空的.因为在mkdir后,它对应的cgroup并没有关联任何task.
如下所示:
[root@localhost cgroup]# mkdir eric
[root@localhost cgroup]# cat eric/tasks
[root@localhost cgroup]#
下面我们来看一下task文件的写操作,也就是怎样将进程添加进cgroup.

7.5.4:task文件的write操作
根据上面的文件,可得知task文件的write操作对应的函数为int cgroup_tasks_write().代码如下:
static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
{
int ret;
/*如果cgroup已经被移除了,非法*/
if (!cgroup_lock_live_group(cgrp))
      return -ENODEV;
/*将PID为pid的进程与cgroup关联*/
ret = attach_task_by_pid(cgrp, pid);
cgroup_unlock();
return ret;
}
Attach_task_by_pid()的代码如下:
static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
{
struct task_struct *tsk;
int ret;

/*如果pid不为0.寻找PID为pid的task.并增加其引用计数*/
if (pid) {
      rcu_read_lock();
      tsk = find_task_by_vpid(pid);
      if (!tsk || tsk->flags & PF_EXITING) {
         rcu_read_unlock();
         return -ESRCH;
      }
      get_task_struct(tsk);
      rcu_read_unlock();

      if ((current->euid) && (current->euid != tsk->uid)
         && (current->euid != tsk->suid)) {
         put_task_struct(tsk);
         return -EACCES;
      }
}
/*如果pid为0.表示是将当前进程添加进cgroup*/
else {
      tsk = current;
      get_task_struct(tsk);
}
/*将cgroup与task相关联*/
ret = cgroup_attach_task(cgrp, tsk);
/*操作完成,减少其引用计数*/
put_task_struct(tsk);
return ret;
}
如果写入的是一个不这0的数,表示的是进程的PID值.如果是写入0,表示是将当前进程.这个操作的核心操作是cgroup_attach_task().代码如下:
int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
{
int retval = 0;
struct cgroup_subsys *ss;
struct cgroup *oldcgrp;
struct css_set *cg = tsk->cgroups;
struct css_set *newcg;
struct cgroupfs_root *root = cgrp->root;
int subsys_id;

/*得到与cgroup关联的第一个subsys的序号*/
get_first_subsys(cgrp, NULL, &subsys_id);

/* Nothing to do if the task is already in that cgroup */
/*找到这个进程之前所属的cgroup*/
oldcgrp = task_cgroup(tsk, subsys_id);
/*如果已经在这个cgrp里面了.*/
if (cgrp == oldcgrp)
      return 0;

/* 遍历与hierarchy关联的subsys
   * 如果subsys定义了can_attach函数,就调用它
   */
for_each_subsys(root, ss) {
      if (ss->can_attach) {
         retval = ss->can_attach(ss, cgrp, tsk);
         if (retval)
            return retval;
      }
}

/*
   * Locate or allocate a new css_set for this task,
   * based on its final set of cgroups
   */
   /*找到这个task所关联的css_set.如果不存在,则新建一个*/
newcg = find_css_set(cg, cgrp);
if (!newcg)
      return -ENOMEM;

task_lock(tsk);

/*如果task正在执行exit操作*/
if (tsk->flags & PF_EXITING) {
      task_unlock(tsk);
      put_css_set(newcg);
      return -ESRCH;
}
/*将tak->cgroup指向这个css_set*/
rcu_assign_pointer(tsk->cgroups, newcg);
task_unlock(tsk);

/* Update the css_set linked lists if we're using them */
/*更改task->cg_list*/
write_lock(&css_set_lock);
if (!list_empty(&tsk->cg_list)) {
      list_del(&tsk->cg_list);
      list_add(&tsk->cg_list, &newcg->tasks);
}
write_unlock(&css_set_lock);

/* 遍历与hierarchy关联的subsys
   * 如果subsys定义了attach 函数,就调用它
   */
for_each_subsys(root, ss) {
      if (ss->attach)
         ss->attach(ss, cgrp, oldcgrp, tsk);
}
set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
synchronize_rcu();
/*减小旧指向的引用计数*/
put_css_set(cg);
return 0;
}
这个函数逻辑很清楚,它就是初始化task->cgroup.然后将它和subsys相关联.可自行参照代码中的注释进行分析.这里就不再赘述了.
在这里,详细分析一下find_css_set()函数,这个函数有点意思.代码如下:
static struct css_set *find_css_set(
struct css_set *oldcg, struct cgroup *cgrp)
{
struct css_set *res;
struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
int i;

struct list_head tmp_cg_links;
struct cg_cgroup_link *link;

struct hlist_head *hhead;

/* First see if we already have a cgroup group that matches
   * the desired set */
read_lock(&css_set_lock);
/*寻找从oldcg转换为cgrp的css_set.如果不存在,返回NULL */
res = find_existing_css_set(oldcg, cgrp, template);
/*如果css_set已经存在,增加其引用计数后退出*/
if (res)
      get_css_set(res);
read_unlock(&css_set_lock);

if (res)
      return res;
这一部份,先从哈希数组中搜索从oldcg转换cgrp的css_set.如果不存在,返回NULL.如果在哈希数组中存放,增加其引用计数返回即可.
Find_existing_css_set()的代码如下:
static struct css_set *find_existing_css_set(
struct css_set *oldcg,
struct cgroup *cgrp,
struct cgroup_subsys_state *template[])
{
int i;
struct cgroupfs_root *root = cgrp->root;
struct hlist_head *hhead;
struct hlist_node *node;
struct css_set *cg;

/* Built the set of subsystem state objects that we want to
   * see in the new css_set */
for (i = 0; i
      if (root->subsys_bits & (1UL
         /* Subsystem is in this hierarchy. So we want
         * the subsystem state from the new
         * cgroup */
         template = cgrp->subsys;
      } else {
         /* Subsystem is not in this hierarchy, so we
         * don't want to change the subsystem state */
         template = oldcg->subsys;
      }
}

hhead = css_set_hash(template);
hlist_for_each_entry(cg, node, hhead, hlist) {
      if (!memcmp(template, cg->subsys, sizeof(cg->subsys))) {
         /* All subsystems matched */
         return cg;
      }
}

/* No existing cgroup group matched */
return NULL;
}
如果subsys与新的cgroup相关联,那么它指向新的cgroup->subsys[]中的对应项.否则指向旧的cgrop的对应项.这样做主要是因为,该进程可能还被关联在其它的hierarchy中.所以要保持它在其它hierarchy中的信息.
最后,在css_set_table[ ]中寻找看是否有与template相等的项.有的话返回该项.如果没有.返回NULL.

/*分配一个css_set*/
res = kmalloc(sizeof(*res), GFP_KERNEL);
if (!res)
      return NULL;

/* Allocate all the cg_cgroup_link objects that we'll need */
/*分配root_count项cg_cgroup_link*/
if (allocate_cg_links(root_count, &tmp_cg_links)
      kfree(res);
      return NULL;
}

/* 初始化刚分配的css_set */
atomic_set(&res->refcount, 1);
INIT_LIST_HEAD(&res->cg_links);
INIT_LIST_HEAD(&res->tasks);
INIT_HLIST_NODE(&res->hlist);

/* Copy the set of subsystem state objects generated in
   * find_existing_css_set() */
   /*设置css_set->subsys*/
memcpy(res->subsys, template, sizeof(res->subsys));
运行到这里的话.表示没有从css_set_table[ ]中找到相应项.因此需要分配并初始化一个css_set结构.并且设置css_set的subsys域.

write_lock(&css_set_lock);
/* Add reference counts and links from the new css_set. */
/*遍历所有的subsys以及css_set 中的subsys[ ].
   *建立task所在的cgroup到css_set的引用
   */
for (i = 0; i
      struct cgroup *cgrp = res->subsys->cgroup;
      struct cgroup_subsys *ss = subsys;
      atomic_inc(&cgrp->count);
      /*
      * We want to add a link once per cgroup, so we
      * only do it for the first subsystem in each
      * hierarchy
      */
      if (ss->root->subsys_list.next == &ss->sibling) {
         BUG_ON(list_empty(&tmp_cg_links));
         link = list_entry(tmp_cg_links.next,
                  struct cg_cgroup_link,
                  cgrp_link_list);
         list_del(&link->cgrp_link_list);
         list_add(&link->cgrp_link_list, &cgrp->css_sets);
         link->cg = res;
         list_add(&link->cg_link_list, &res->cg_links);
      }
}

/*似乎没有地方会更改rootnode.subsys_list.?这里的判断大部份情况是满足的*/
if (list_empty(&rootnode.subsys_list)) {
      /*建立这个css_set到dumytop的引用*/
      /* 这样做,是为了让新建的hierarchy能够关联到所有的进程*/
      link = list_entry(tmp_cg_links.next,
               struct cg_cgroup_link,
               cgrp_link_list);
      list_del(&link->cgrp_link_list);
      list_add(&link->cgrp_link_list, &dummytop->css_sets);
      link->cg = res;
      list_add(&link->cg_link_list, &res->cg_links);
}
BUG_ON(!list_empty(&tmp_cg_links));
这一部份的关键操作都在代码中添加了相应的注释.如果系统中存在多个hierarchy.那么这个进程肯定也位于其它的hierarchy所对应的cgroup中.因此需要在新分配的css_set中保存这些信息,也就是建立从cgroup到css_set的引用.
另外,关于ist_empty(&rootnode.subsys_list)的操作.似乎没看到有什么地方会更改rootnode.subsys_list.不过,如果rootnode.subsys_list不为空的话,也会在它前面的for循环中检测出来.
总而言之.系统中有root_count个hierarchy.上述的引用保存过程就会进行root_count次.因此.到最后.tmp_cg_links肯定会空了.如果不为空.说明某处发生了错误.

/*增加css_set计数*/
css_set_count++;

/* Add this cgroup group to the hash table */
/*将其添加到全局哈希数组: css_set_table[ ]*/
hhead = css_set_hash(res->subsys);
hlist_add_head(&res->hlist, hhead);

write_unlock(&css_set_lock);

return res;
}
最后,将生成的css_set添加到哈希数组css_set_table[ ]中.
到这里,task文件的操作已经分析完了.

7.6: notify_on_release文件操作
notify_on_release文件对应的cftype结构如下:
{
      .name = "notify_on_release",
      .read_u64 = cgroup_read_notify_on_release,
      .write_u64 = cgroup_write_notify_on_release,
      .private = FILE_NOTIFY_ON_RELEASE,
}

从此得知.文件的读操作接口为cgroup_read_notify_on_release().代码如下:
static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,
                     struct cftype *cft)
{
return notify_on_release(cgrp);
}
继续跟进notify_on_release().如下示:
static int notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
}
从此可以看到,如果当前cgroup设置了CGRP_NOTIFY_ON_RELEASE标志.就会返回1.否则.就是为0.
从当前系统中测试一下,如下:
[root@localhost cgroup]# cat notify_on_release
0
[root@localhost cgroup]#
文件内容为零.因为top_cgroup上没有设置CGRP_NOTIFY_ON_RELEASE的标志.

notify_on_release文件读操作接口为cgroup_write_notify_on_release().代码如下:
static int cgroup_write_notify_on_release(struct cgroup *cgrp,
                  struct cftype *cft,
                  u64 val)
{
clear_bit(CGRP_RELEASABLE, &cgrp->flags);
if (val)
      set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
else
      clear_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
return 0;
}
从上面的代码可以看到.如果我们写入的是1.就会设置cgroup标志的CGRP_NOTIFY_ON_RELEASE位.否则.清除CGRP_NOTIFY_ON_RELEASE位.测试如下:
[root@localhost cgroup]# echo 1 > notify_on_release
[root@localhost cgroup]# cat notify_on_release
1
[root@localhost cgroup]# echo 0 > notify_on_release
[root@localhost cgroup]# cat notify_on_release
0
[root@localhost cgroup]#

7.7: release_agent文件操作
release_agent只有在顶层目录才会有.它所代表的cftype结构如下:
static struct cftype cft_release_agent = {
.name = "release_agent",
.read_seq_string = cgroup_release_agent_show,
.write_string = cgroup_release_agent_write,
.max_write_len = PATH_MAX,
.private = FILE_RELEASE_AGENT,
};

由此可以看到.读文件的接口为cgroup_release_agent_show.代码如下:
static int cgroup_release_agent_show(struct cgroup *cgrp, struct cftype *cft,
                  struct seq_file *seq)
{
if (!cgroup_lock_live_group(cgrp))
      return -ENODEV;
seq_puts(seq, cgrp->root->release_agent_path);
seq_putc(seq, '\n');
cgroup_unlock();
return 0;
}
从代码中可以看到.就是打印出root的release_agent_path.

写文件的接口为cgroup_release_agent_write().如下示:
static int cgroup_release_agent_write(struct cgroup *cgrp, struct cftype *cft,
                  const char *buffer)
{
BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path)
if (!cgroup_lock_live_group(cgrp))
      return -ENODEV;
strcpy(cgrp->root->release_agent_path, buffer);
cgroup_unlock();
return 0;
}
由此得知.往这个文件中写内容,就是设置root的release_agent_path.如下做个测试:
[root@localhost cgroup]# cat release_agent

[root@localhost cgroup]# echo /bin/ls > release_agent
[root@localhost cgroup]# cat release_agent
/bin/ls
[root@localhost cgroup]#

7.8:debug创建的文件分析
下面分析一下debug subsys中的文件.由于我们挂载的时候没有带noprefix.因为.debug生成的文件都带了一个”debug_”前缀.由debug创建的文件如下示:
debug.cgroup_refcount  debug.current_css_set_refcount  debug.taskcount debug.current_css_set  debug.releasable
挨个分析如下:
7.8.1: cgroup_refcount文件操作
Cgroup_refcount所代表的cftype结构如下示:
{
      .name = "cgroup_refcount",
      .read_u64 = cgroup_refcount_read,
},
可以看到,该文件不能写,只能读.读操作接口为cgroup_refcount_read().代码如下:
static u64 cgroup_refcount_read(struct cgroup *cont, struct cftype *cft)
{
return atomic_read(&cont->count);
}
它就是显示出当前cgroup的引用计数.
测试如下:
[root@localhost cgroup]# cat debug.cgroup_refcount
0
[root@localhost cgroup]#
顶层的cgroup是位于cgroupfs_root.top_cgroup.它的引用计数为0.
接下来,我们在下层创建一个子层cgroup.如下示:
[root@localhost cgroup]# mkdir /dev/cgroup/eric
[root@localhost cgroup]# cat /dev/cgroup/eric/debug.cgroup_refcount
0
[root@localhost cgroup]#
可见创建子层cgroup不会增加其引用计数.因为它只是与它的上一层cgroup构成指针指向关系.
现在我们让子层cgroup关联一个进程
[root@localhost cgroup]# echo 1673 > /dev/cgroup/eric/tasks
[root@localhost cgroup]# cat /dev/cgroup/eric/debug.cgroup_refcount
1
[root@localhost cgroup]#
可以看到.它的计数比为了1.这里在关联进程的css_set和所在的cgroup时增加的.

7.8.2: current_css_set文件操作
current_css_set对应的cftype结构如下示:
{
      .name = "current_css_set",
      .read_u64 = current_css_set_read,
},
可看出.它也是一个只读的.读接口为current_css_set_read().代码如下:
static u64 current_css_set_read(struct cgroup *cont, struct cftype *cft)
{
return (u64)(long)current->cgroups;
}
它就是显示了当前进程关联的css_set的地址.
测试如下:
[root@localhost cgroup]# cat debug.current_css_set
18446744072645980768

7.8.3: current_css_set_refcount文件操作
current_css_set_refcount文件对应的ctype结构如下:
{
      .name = "current_css_set_refcount",
      .read_u64 = current_css_set_refcount_read,
},
照例.它也是只读的.接口如下:
static u64 current_css_set_refcount_read(struct cgroup *cont,
                     struct cftype *cft)
{
u64 count;

rcu_read_lock();
count = atomic_read(¤t->cgroups->refcount);
rcu_read_unlock();
return count;
}
它就是显示出与当前进程关联的css_set的引用计数.
测试如下:
[root@localhost cgroup]# cat debug.current_css_set_refcount
56
表示已经有56个进程关联到这个css_set了.

7.8.3: taskcount文件操作
Taskcount文件对应cftype结构如下:
{
      .name = "taskcount",
      .read_u64 = taskcount_read,
},
只读文件.接口如下:
static u64 taskcount_read(struct cgroup *cont, struct cftype *cft)
{
u64 count;

cgroup_lock();
count = cgroup_task_count(cont);
cgroup_unlock();
return count;
}
其中,子函数cgroup_task_count()我们在之前已经分析过了.它就是计算与当前cgroup关联的进程数目.这里就不再分析了.测试如下:
[root@localhost cgroup]# cat debug.taskcount
56

7.8.4: releasable文件操作
Releasable文件对应的ctype结构如下示:
{
      .name = "releasable",
      .read_u64 = releasable_read,
},
只读,读接口代码如下:
static u64 releasable_read(struct cgroup *cgrp, struct cftype *cft)
{
return test_bit(CGRP_RELEASABLE, &cgrp->flags);
}
它用来查看当前cgroup是否有CGRP_RELEASABLE标志.如果有.显示为1.否则显示为0.
测试如下:
[root@localhost cgroup]# cat debug.releasable
0
经过上面的分析.可以知道.如果往cgroup中删除一个关联进程,就会将其设置CGRP_RELEASABLE标志.有下面测试:
[root@localhost cgroup]# mkdir eric
[root@localhost cgroup]# cat eric/debug.releasable
0
[root@localhost cgroup]# echo 1650 > eric/tasks
[root@localhost cgroup]# echo 1701 > eric/tasks
[root@localhost cgroup]# cat eric/debug.releasable
0
[root@localhost cgroup]# echo 1650 >tasks
[root@localhost cgroup]# cat eric/debug.releasable
1

到这里为止,各subsys共有的文件和debug中的文件操作就已经分析完了.其它的subsys远远比debug要复杂.之后再给出专题分析.详情请关注本站更新.*^_^*

九: notify_on_release操作
下面我们来分析在之前一直在忽略的一个问题.也就是涉及到CGRP_NOTIFY_ON_RELEASE标志和root-> release_agent_path[]部份.
它的重用,就是在cgroup中最后的一个进程离开(包括进程退出.进程关联到其它同类型的cgroup),或者是在最后一个子层cgroup被移除的时候.就会调用用户空间的一个程序.这个程序的路径是在root-> release_agent_path[]中指定的.
下面我们从代码的角度来跟踪一下.

9.1:进程退出
我们在之前在分析父子进程之间的cgroup关系的时候.忽略掉了__put_css_set函数中的一个部份.现在是时候来剥开它了.
次__put_css_set()被忽略的代码片段列出,如下:
static void __put_css_set(struct css_set *cg, int taskexit)
{
．．．．．．
．．．．．．
for (i = 0; i
      struct cgroup *cgrp = cg->subsys->cgroup;
      if (atomic_dec_and_test(&cgrp->count) &&
         notify_on_release(cgrp)) {
         if (taskexit)
            set_bit(CGRP_RELEASABLE, &cgrp->flags);
         check_for_release(cgrp);
      }
}
．．．．．．
．．．．．．
}
首先,进程退出时,调用__put_css_set时.taskexit参数是为1的,因此在这里,它会将cgroup的flag的CGRP_RELEASABLE位置1.
atomic_dec_and_test(&cgrp->count)返回为真的话,说明进程所属的cgroup中已经没有其它的进程了.因此即将要退出的子进程就是cgroup中的最后一个进程.
notify_on_release(cgrp)代码如下:
static int notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
}
它用来判断cgroup有没有设定CGRP_NOTIFY_ON_RELEASE标志
综合上面的分析.如果cgroup中最后一个进程退出.且cgroup设定了CGRP_NOTIFY_ON_RELEASE标志.流程就会转到check_for_release()中.该函数代码如下:
static void check_for_release(struct cgroup *cgrp)
{
/* All of these checks rely on RCU to keep the cgroup
   * structure alive */
if (cgroup_is_releasable(cgrp) && !atomic_read(&cgrp->count)
      && list_empty(&cgrp->children) && !cgroup_has_css_refs(cgrp)) {
      /* Control Group is currently removeable. If it's not
      * already queued for a userspace notification, queue
      * it now */
      int need_schedule_work = 0;
      spin_lock(&release_list_lock);
      if (!cgroup_is_removed(cgrp) &&
         list_empty(&cgrp->release_list)) {
         list_add(&cgrp->release_list, &release_list);
         need_schedule_work = 1;
      }
      spin_unlock(&release_list_lock);
      if (need_schedule_work)
         schedule_work(&release_agent_work);
}
}
首先,在这里必须要满足以下四个条件才能继续下去:
1:cgroup_is_releasable()返回1.
代码如下:
static int cgroup_is_releasable(const struct cgroup *cgrp)
{
const int bits =
      (1
      (1
return (cgrp->flags & bits) == bits;
}
它表示当前cgroup是含含有CGRP_RELEASABLE和CGRP_NOTIFY_ON_RELEASE标志.结合我们在上面分析的. CGRP_RELEASABLE标志是进程在退出是就会设置的.

2:cgroup的引用计数为0
3:cgroup没有子层cgroup
4: cgroup_has_css_refs()返回0.代码如下:
static int cgroup_has_css_refs(struct cgroup *cgrp)
{
int i;
for (i = 0; i
      struct cgroup_subsys *ss = subsys;
      struct cgroup_subsys_state *css;
      /* Skip subsystems not in this hierarchy */
      if (ss->root != cgrp->root)
         continue;
      css = cgrp->subsys[ss->subsys_id];
      if (css && atomic_read(&css->refcnt))
         return 1;
}
return 0;
}
也就是说,cgroup关联的css_set引用计数必须要为0

满足上面几个条件之后.就说明该cgroup是可以释放的.因此将cgroup链接到了release_list.接着调度了工作队列.在工作队列中会完成余下的工作.
下面跟踪看看这个工作队列是怎么处理余下任务的.
release_agent_work定义如下:
static DECLARE_WORK(release_agent_work, cgroup_release_agent);
该工作队列对应的处理函数为cgroup_release_agent().代码如下:
static void cgroup_release_agent(struct work_struct *work)
{
BUG_ON(work != &release_agent_work);
mutex_lock(&cgroup_mutex);
spin_lock(&release_list_lock);
/*遍历链表,直到其为空*/
while (!list_empty(&release_list)) {
      char *argv[3], *envp[3];
      int i;
      char *pathbuf = NULL, *agentbuf = NULL;
      /*取得链表项对应的cgroup*/
      struct cgroup *cgrp = list_entry(release_list.next,
                        struct cgroup,
                        release_list);
      /*将cgroup从release_list中断开*/
      list_del_init(&cgrp->release_list);
      spin_unlock(&release_list_lock);
      /*将cgroup的路径存放到pathbuf中*/
      pathbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
      if (!pathbuf)
         goto continue_free;
      if (cgroup_path(cgrp, pathbuf, PAGE_SIZE)
         goto continue_free;
      /*agentbuf存放release_agent_path的内容*/
      agentbuf = kstrdup(cgrp->root->release_agent_path, GFP_KERNEL);
      if (!agentbuf)
         goto continue_free;
      /*初始化运行参数和环境变量*/
      i = 0;
      argv[i++] = agentbuf;
      argv[i++] = pathbuf;
      argv = NULL;

      i = 0;
      /* minimal command environment */
      envp[i++] = "HOME=/";
      envp[i++] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin";
      envp = NULL;

      /* Drop the lock while we invoke the usermode helper,
      * since the exec could involve hitting disk and hence
      * be a slow process */
      /*调用用户空间的进程*/
      mutex_unlock(&cgroup_mutex);
      call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
      mutex_lock(&cgroup_mutex);
continue_free:
      kfree(pathbuf);
      kfree(agentbuf);
      spin_lock(&release_list_lock);
}
spin_unlock(&release_list_lock);
mutex_unlock(&cgroup_mutex);
}
该函数遍历release_list中的cgroup.然后以其路径做为参数.调用root->release_agent_path对应的程序.
我们来做如下的实验:
为了配合这次实验.必须要写两个测试的程序.代码如下:
Test.c

#include
#include

main()
{
      int i = 30;
      while(i){
            i--;
            sleep(1);
      }
}

这个进程睡眠30s之后退出.编译成test

另外一个程序代码如下:
Main.c
#include
#include

int main(int argc,char *argv[])
{
      char buf[125] = "";
      int i = 0;

      sprintf(buf,"rm -f /var/eric_test");
      system(buf);

      while(i
            sprintf(buf,"echo %s >> /var/eric_test",argv);
            system(buf);
            i++;
      }

}
它就是将调用参数输出到/var/eric_test下面.
下面就可以开始我们的测试了.挂载目录下已经有一个子层cgroup.如下示:
.
|-- debug.cgroup_refcount
|-- debug.current_css_set
|-- debug.current_css_set_refcount
|-- debug.releasable
|-- debug.taskcount
|-- eric
| |-- debug.cgroup_refcount
| |-- debug.current_css_set
| |-- debug.current_css_set_refcount
| |-- debug.releasable
| |-- debug.taskcount
| |-- notify_on_release
| `-- tasks
|-- notify_on_release
|-- release_agent
`-- tasks

接下来设置realesse_agent_path和CGRP_NOTIFY_ON_RELEASE标志,指令如下:
[root@localhost cgroup]# echo /root/main > release_agent
[root@localhost cgroup]# echo 1 > eric/notify_on_release
下面往子层cgroup中添加一个进程.指令如下:
[root@localhost cgroup]# /root/test &
[1] 4350
[root@localhost cgroup]# echo 4350 > eric/tasks
[root@localhost cgroup]#
[1]+  Done                   /root/test
等/root/test运行完之后.就会进行notify_on_release的操作了.印证一下:
[root@localhost cgroup]# cat /var/eric_test
/root/main
/eric
一切都如我们上面分析的一样

9.2:取消进程与cgroup的关联
当cgroup中的最后一个进程取消关联的时候,也会有notify_on_release过程.见下面的代码片段:
int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
{
int retval = 0;
struct cgroup_subsys *ss;
struct cgroup *oldcgrp;
struct css_set *cg = tsk->cgroups;
......
......
set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
synchronize_rcu();
put_css_set(cg);
}
这个函数我们在之前分析过,不过也把notify_on_release的过程去掉了.现在也把它加上.
代码中的cg是指向进程原本所引用的css_set
Oldcgrp是过程之前所在的cgroup
在代码中,会将oldcgrp标志设为CGRP_RELEASABLE.之后也会调用put_css_set().put_css_set()就是我们在上面分析的过程了.如果cgroup为空的话,就会产生notify_on_release的操作.
同样做个测试:
接着上面的测试环境.我们先来看下环境下的相关文件内容:
[root@localhost cgroup]# cat release_agent
/root/main
[root@localhost cgroup]# cat eric/tasks
[root@localhost cgroup]# cat eric/notify_on_release
1
[root@localhost cgroup]# pwd
/dev/cgroup
好了,测试开始了:
[root@localhost cgroup]# rm -rf /var/eric_test
[root@localhost cgroup]# echo 1701 > eric/tasks
[root@localhost cgroup]# echo 1701 >tasks
[root@localhost cgroup]# cat /var/eric_test
/root/main
/eric
在上面的测试过程中.为了避免影响测试效果.先将/var/eric_test文件删了.然后将进程1701关联到eric所表示的cgroup.然后再把1701再加最上层cgroup.这样就会造成eric下关联进程为空.相应的会发生notify_on_release过程.上面的测试也印证了这一说话.

9.3:移除cgroup
当移除cgroup下的最后一个子层cgroup时.也会发生notify_on_release.
看一下移除cgroup时的代码片段:
static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
{
......
......
set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);
......
}
代码中,parent表示cgroup的上一层.在移除cgroup时,会设置上一层的cgroup标志的CGRP_RELEASABLE位.然后流程同样会转入到check_for_release().这样,如果上一层cgroup是空的话.就会生notify_on_release操作了.
测试如下:
还是用上层的测试环境.先来看一下初始环境:
[root@localhost cgroup]# pwd
/dev/cgroup
[root@localhost cgroup]# cat release_agent
/root/main
[root@localhost cgroup]# cat eric/notify_on_release
1
在eric下面再加一层cgroup.
[root@localhost cgroup]# mkdir eric/test
[root@localhost cgroup]# tree
.
|-- debug.cgroup_refcount
|-- debug.current_css_set
|-- debug.current_css_set_refcount
|-- debug.releasable
|-- debug.taskcount
|-- eric
| |-- debug.cgroup_refcount
| |-- debug.current_css_set
| |-- debug.current_css_set_refcount
| |-- debug.releasable
| |-- debug.taskcount
| |-- notify_on_release
| |-- tasks
| `-- test
|    |-- debug.cgroup_refcount
|    |-- debug.current_css_set
|    |-- debug.current_css_set_refcount
|    |-- debug.releasable
|    |-- debug.taskcount
|    |-- notify_on_release
|    `-- tasks
|-- notify_on_release
|-- release_agent
`-- tasks

2 directories, 22 files
接着运行如下指令:
[root@localhost cgroup]# rm -rf /var/eric_test
[root@localhost cgroup]# rmdir eric/test/
[root@localhost cgroup]# cat /var/eric_test
/root/main
/eric
如上所示.把eric下的唯一一个cgroup移除的时候.就发生了notity_on_release过程.

十:cgroup的proc节点
10.1:/proce/cgroups
在前面分析cgroup初始化的时候.在cgroup_init()中有下面代码片段:
int __init cgroup_init(void)
{
．．．．．．
．．．．．．
proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations)
．．．．．．
．．．．．．
}
也就是说.会在proc根目录下创建一个名为cgroups的文件.如下示:
[root@localhost cgroup]# ls /proc/cgroups
/proc/cgroups
接下来就来分析这个文件的操作.
该文件对应的操作集为
proc_cgroupstats_operations.定义如下:
static struct file_operations proc_cgroupstats_operations = {
.open = cgroupstats_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
}
从上面看到,这个文件是只读的.
先来看open时的操作,对应接口为cgroupstats_open.代码如下:
static int cgroupstats_open(struct inode *inode, struct file *file)
{
return single_open(file, proc_cgroupstats_show, NULL);
}
Single_open()函数十分简单.它也是sequences file中提供的一个接口.有关sequences file部份我们在上面已经分析过了. 这里就不再详细分析了.它将seq_file的show操作指向了proc_cgroupstats_show.
我们在上面的proc_cgroupstats_operations结构中可看到,它提供的read操作为seq_read().它就是调用 seq_file中的相关操作.在open的时候,已经将seq_file的show接口指向了proc_cgroupstats_show().代码如下:
static int proc_cgroupstats_show(struct seq_file *m, void *v)
{
int i;

seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
mutex_lock(&cgroup_mutex);
for (i = 0; i
      struct cgroup_subsys *ss = subsys;
      seq_printf(m, "%s\t%lu\t%d\t%d\n",
            ss->name, ss->root->subsys_bits,
            ss->root->number_of_cgroups, !ss->disabled);
}
mutex_unlock(&cgroup_mutex);
return 0;
}
从代码中看到,它就是将系统中每subsys名称.所在hierarchy的位码. Hierarchy下面的cgroup数目和subsys的启用状态.
测试如下:
[root@localhost cgroup]# cat /proc/cgroups
#subsys_name hierarchy    num_cgroups    enabled
cpuset  0    1    1
debug 2    2    1
ns    0    1    1
cpuacct 0    1    1
memory  0    1    1
devices 0    1    1
freezer 0    1    1
从这里可以看到所有的subsys和hierarchy的情况.在上面显示的debug和其它的subsys不同.是因为用的是之前测试notify_on_release的环境.如下示:
[root@localhost cgroup]# tree ../cgroup/
../cgroup/
|-- debug.cgroup_refcount
|-- debug.current_css_set
|-- debug.current_css_set_refcount
|-- debug.releasable
|-- debug.taskcount
|-- eric
| |-- debug.cgroup_refcount
| |-- debug.current_css_set
| |-- debug.current_css_set_refcount
| |-- debug.releasable
| |-- debug.taskcount
| |-- notify_on_release
| `-- tasks
|-- notify_on_release
|-- release_agent
`-- tasks

1 directory, 15 files

10.2:proc下进程镜像中的cgroup
除了在proc顶层目录创建cgroup外.另外在每个进程镜像下都有一个cgroup的文件.如下示:
[root@localhost cgroup]# ls /proc/648/cgroup
/proc/648/cgroup

来看一下这个文件对应的操作,如下示:
static const struct pid_entry tid_base_stuff[] = {
．．．．．．
．．．．．．
#ifdef CONFIG_CGROUPS
REG("cgroup",  S_IRUGO, cgroup),
#endif
．．．．．．
}

#define REG(NAME, MODE, OTYPE)             \
NOD(NAME, (S_IFREG|(MODE)), NULL,    \
      &proc_##OTYPE##_operations, {})
从上面可以看到．Cgroup对应的操作为&proc_cgroup_operations
定义如下：
struct file_operations proc_cgroup_operations = {
.open    = cgroup_open,
.read    = seq_read,
.llseek    = seq_lseek,
.release = single_release,
};
Open对应的操作为cgroup_open.定义如下：
static int cgroup_open(struct inode *inode, struct file *file)
{
struct pid *pid = PROC_I(inode)->pid;
return single_open(file, proc_cgroup_show, pid);
}
又见到single_open()了．如上面的分析一样，read操作的时候会转入到proc_cgroup_show()．代码如下：
static int proc_cgroup_show(struct seq_file *m, void *v)
{
struct pid *pid;
struct task_struct *tsk;
char *buf;
int retval;
struct cgroupfs_root *root;

retval = -ENOMEM;
buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!buf)
      goto out;

retval = -ESRCH;
pid = m->private;
tsk = get_pid_task(pid, PIDTYPE_PID);
if (!tsk)
      goto out_free;

retval = 0;

mutex_lock(&cgroup_mutex);

/*遍历所有的cgroupfs_root*/
for_each_root(root) {
      struct cgroup_subsys *ss;
      struct cgroup *cgrp;
      int subsys_id;
      int count = 0;

      /* Skip this hierarchy if it has no active subsystems */
      /*如果hierarchy中没有subsys.就继续下一个rootnode就是这样的情况*/
      if (!root->actual_subsys_bits)
         continue;
      /*打印hierarchy中的subsys位图*/
      seq_printf(m, "%lu:", root->subsys_bits);
      /*打印hierarchy中的subsys名称*/
      for_each_subsys(root, ss)
         seq_printf(m, "%s%s", count++ ? "," : "", ss->name);
      seq_putc(m, ':');
      /*进程所在cgroup的path*/
      get_first_subsys(&root->top_cgroup, NULL, &subsys_id);
      cgrp = task_cgroup(tsk, subsys_id);
      retval = cgroup_path(cgrp, buf, PAGE_SIZE);
      if (retval
         goto out_unlock;
      seq_puts(m, buf);
      seq_putc(m, '\n');
}

out_unlock:
mutex_unlock(&cgroup_mutex);
put_task_struct(tsk);
out_free:
kfree(buf);
out:
return retval;
}
它的核心操作在这个for循环中,它的操作在注释中已经详细的说明了．在这里不做详细分析．
我将虚拟机重启了　*^_^*,所以现在的环境不是我们之前的测试环境了
测试一下：
[root@localhost ~]# cat /proc/646/cgroup
[root@localhost ~]#
说明当前系统中还没有hierarchy．
接下来挂载上一个:
[root@localhost ~]# mkdir /dev/cgroup
[root@localhost ~]# mount -t cgroup cgroup -o debug /dev/cgroup/
[root@localhost ~]# cat /proc/6
6/ 609/ 646/
[root@localhost ~]# cat /proc/646/cgroup
2:debug:/
[root@localhost ~]#
从上面可以看到．系统已经有一个hierarchy.且绑定的是debug subsys.当前进程是位于它的顶层.
继续测试：
[root@localhost ~]# mkdir /dev/cgroup/eric
[root@localhost ~]# echo 646 > /dev/cgroup/eric/tasks
[root@localhost ~]# cat /proc/646/cgroup
2:debug:/eric
[root@localhost ~]#
可以看到，当前进程是位于eric这个cgroup中．

十一：小结
在这一节里，用大篇幅详细的描述了整个cgroup的框架.cgroup框架并不复杂，只是其中的数据结构和大量的全局变量弄的头昏眼花．因此理顺这些数据结构和变量是阅读cgroup代码的关键.另外在cgroup中对于RCU和rw_mutex的使用也有值得推敲的地方.不过由于篇幅关系,就不再分析这一部份.在接下来专题里.以cgroup框架为基础来分析几个重要的subsys.

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u1/51562/showart_1736813.html

阅读(4926) | 评论(0) | 转发(0) |

上一篇：虚拟化性能分析（XEN/KVM/LXC)

下一篇：关于dstat统计网卡数据的实时性问题

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6