kernel 3.10内核源码分析--slab原理及相关代码-星闪夜空-ChinaUnix博客

星闪夜空

首页　| 　博文目录　| 　关于我

星闪夜空

博客访问： 165545
博文数量： 14
博客积分： 10
博客等级：民兵
技术积分： 1113
用户组：普通用户
注册时间： 2012-10-24 14:22

个人简介

喜欢Linux

文章分类

全部博文（14）

文章存档

2015年（5）

2013年（9）

我的朋友

相关博文

kernel 3.10内核源码分析--slab原理及相关代码

分类： LINUX

2015-12-18 12:11:11

原文地址：kernel 3.10内核源码分析--slab原理及相关代码作者：humjb_1983

kernel 3.10内核源码分析--slab原理及相关代码

1、基本原理

我们知道，Linux保护模式下，采用分页机制，内核中物理内存使用buddy system(伙伴系统)进行管理，管理的内存单元大小为一页，也就是说使用buddy system分配内存最少需要分配一页大小。那如果需要分配小于一页的内存该怎么办呢？

另一方面，内核中经常需要大量的数据结构(比如struct task_strcut)，这些数据结构的频繁分配和释放对性能影响较大。

Slab正是用于解决上述的两个问题， Slab 分配器源于 Solaris 2.4 的分配算法，工作于buddy system之上，用于管理特定大小对象的缓存，提高小块内存或特定对象内存分配效率。

Slab的两个用途如前面所述：1、缓存和管理内核中经常使用的数据结构对象，内核中使用slab提供的专用的接口，可以实现数据结构对象的快速分配，大大减少相关开销，提升效率。2、缓存和管理小块内存，也称通用缓存，用于kmalloc的底层实现和支撑。小块内存即小于一页大小的内存，以2^n字节对齐，比如512B、128B、64B、32B等。

2、slab管理结构图

3、缓冲区(kmem_cache)

Slab分配器为每种内核对象建立单独的缓冲区(kmem_cache)，每个缓冲区包含多个 slab ，每个 slab包含一组连续的物理内存页，这些内存页被划分成固定数目的对象。根据对象大小的不同，缺省情况下一个 slab 最多可以由 1024 个物理内存页框构成。充分利用硬件特性，需要对每个对象进行一定的对齐处理(比如cache line对齐)，所以slab 中分配给对象的内存可能大于用户要求的对象实际大小，可能会有一定的内存浪费。

内核使用 kmem_cache 数据结构管理缓冲区。每个缓冲区中，对每个NUMA node都有三个slab链表：

Full 链表，该链表中的所有slab中的对象都已经被完全分配，没有空闲对象。

Partial 链表，该链表中的slab中还有空闲对象，从slab中分配对象时，优先从此链表中分配。当 slab 的最后一个已分配对象被释放时，该 slab 将从 Partial 链表移入 free链表；当 slab 的最后一个空闲对象被分配时，该 slab 将从Partial 链表移入Full 链表里。

Free链表，该链表中的slab中全是空闲对象，当partial链表中slab用尽时，从这里分配，并将相应的slab从free链表移入partial链表。

当缓冲区中空闲对象总数不足时，则按需从伙伴系统中分配更多的page，用于slab；当空闲对象比较富余，free链表中的的部分 slab 可能被定期回收。

由于 kmem_cache 自身也是一种内核对象，所以需要一个专门的缓冲区。所有缓冲区的 kmem_cache 控制结构被组织成以 cache_chain 为队列头的一个双向循环队列，同时 cache_cache 全局变量指向kmem_cache 对象缓冲区的 kmem_cache 对象。
kmem_cache结构定义如下：

点击(此处)折叠或打开

/*用于存放slab的缓存(kmem_cache)描述符*/
struct kmem_cache {
/* 1) Cache tunables. Protected by cache_chain_mutex */
/*当per-CPU缓存列表为空时，从slab中获取对象的数目；或者cache_grow时一次分配的对象数目*/
unsigned int batchcount;
/*per-CPU缓存列表(array?)中保存的最大对象数目*/
unsigned int limit;
/*是否存在共享CPU高速缓存*/
unsigned int shared;
/*slab中管理对象的长度(包括对齐填充字节)*/
unsigned int size;
/*为提升对象索引效率，使用Newton-Raphson方法使用的参数。*/
u32 reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */
/*定义kmem_cache的性质，当前只使用了一个标志:CFLAG_OFF_SLAB，表示slab描述符位于slab数据区之外*/
unsigned int flags; /* constant flags */
/*每个kmem_cache(?)中可容纳的最大对象数目*/
unsigned int num; /* # of objs per slab */
/* 3) cache_grow/shrink */
/* order of pgs per slab (2^n) */
/*cache_grow或shrink时，一次性申请或释放的page order*/
unsigned int gfporder;
/* force GFP flags, e.g. GFP_DMA */
/*从buddy system中分配page时，使用的分配标志*/
gfp_t allocflags;
/*着色使用。最大的颜色数目*/
size_t colour; /* cache colouring range */
/*着色使用。着色偏移*/
unsigned int colour_off; /* colour offset */
/*当slab描述符(头部管理数据)存储在slab外部时，slabp_cache指向用于分配slab描述符的"通用缓存"(即用于分配kmalloc数据的slab缓存)*/
struct kmem_cache *slabp_cache;
/*单个slab头部管理数据的大小，包括slab描述符自身和kmem_bufctl_t数组的大小*/
unsigned int slab_size;
/* constructor func */
/*构造函数*/
void (*ctor)(void *obj);
/* 4) cache creation/removal */
/*kmem_cache名称，在/proc/slabinfo中可以看到*/
const char *name;
/*将所有的kmem_cache链入到全局链表中:cache_chain*/
struct list_head list;
/*引用计数*/
int refcount;
/*缓存中对象的长度，跟前面的size有何区别?*/
int object_size;
/*对齐使用*/
int align;
/* 5) statistics */
/*统计信息，打开slab调试开关时使用*/
#ifdef CONFIG_DEBUG_SLAB
unsigned long num_active;
unsigned long num_allocations;
unsigned long high_mark;
unsigned long grown;
unsigned long reaped;
unsigned long errors;
unsigned long max_freeable;
unsigned long node_allocs;
unsigned long node_frees;
unsigned long node_overflow;
atomic_t allochit;
atomic_t allocmiss;
atomic_t freehit;
atomic_t freemiss;
/*
* If debugging is enabled, then the allocator can add additional
* fields and/or padding to every object. size contains the total
* object size including these internal fields, the following two
* variables contain the offset to the user object and its size.
*/
int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */
#ifdef CONFIG_MEMCG_KMEM
struct memcg_cache_params *memcg_params;
#endif
/* 6) per-cpu/per-node data, touched during every alloc/free */
/*
* We put array[] at the end of kmem_cache, because we want to size
* this array to nr_cpu_ids slots instead of NR_CPUS
* (see kmem_cache_init())
* We still use [NR_CPUS] and not [1] or [0] because cache_cache
* is statically defined, so we reserve the max number of cpus.
*
* We also need to guarantee that the list is able to accomodate a
* pointer for each node since "nodelists" uses the remainder of
* available pointers.
*/
/*用于管理slab链表(full、partial、free)的表头，每个node对应一个*/
struct kmem_cache_node **node;
/*per-CPU缓存列表(结构体末尾entry[]数组用于存放被释放的slab对象)，当slab对象被释放时，先释放到该列表中*/
struct array_cache *array[NR_CPUS + MAX_NUMNODES];
/*
* Do not add fields after array[]
*/
}

4、slab描述符

Slab使用 struct slab 数据结构(slab描述符)来描述其状态及进行相关管理。Slab描述符位于每个slab区域的起始，在为slab分配内存时，会先从buddy system中分配预定义数量的page作为slab区域，为充分利用硬件高速缓存，使用了着色机制，在slab区域的首部保留一定字节(通常根据cache line)长度的区域作为着色偏移，而slab描述符就位于此偏移之后(当slab管理数据位于slab内时)。
slab数据结构定义如下：

点击(此处)折叠或打开

struct slab {
union {
struct {
/*链入kmem_cache中的 slab_full 或者slab_patial 或者 slab_free链表中*/
struct list_head list;
/*着色偏移(包括了slab描述符和kmem_bufctl_t数组的长度)，slab起始地址+colouroff=s_mem(object区起始地址)*/
unsigned long colouroff;
/*
* object区起始地址=slab起始地址+colouroff，即实际slab缓存对象所处的起始地址，所有对象在slab区中
* 连续分布，但有做cacheline对齐处理
*/
void *s_mem; /* including colour offset */
unsigned int inuse; /* num of objs active in slab */
/*
* kmem_bufctl_t数组中第一个空闲object在kmem_bufctl_t数组中的索引号，使用此索引，内核无需使用复杂
* 的结构体(比如SunOS中使用的链表)，即可遍历所有空闲的slab对象。
*/
kmem_bufctl_t free;
/*用于在kmem_cache中索引指定node对应的slab链表*/
unsigned short nodeid;
};
struct slab_rcu __slab_cover_slab_rcu;
}

从内存分布情况看，紧接着slab描述符存放的是一个kmem_bufctl_t对象数组，该数组用于管理该slab中的空闲对象，稍后描述。

Slab描述符和kmem_bufctl_t对象数组一起组成了slab的管理数据。Slab管理数据可以直接存放于slab区域内，也可以单独放置。单独放置即放到用kmalloc分配的不同的slab区域中。内核如何选择，取决于slab对象的长度和已用对象的数量，简单算法如下：如果slab对象不超过 1/8 个物理内存页框的大小，那么slab管理数据直接存放于slab区域首部(着色偏移之后)；否则的话，存放在slab区域外部，位于由 kmalloc 分配的通用对象缓冲区中。

在kmem_bufctl_t对象数组之后，就是存放真正的slab对象(object)的区域，由slab->s_mem指向。slab->coloroff即为slab区域首部到slab->s_mem的偏移。slab->free中存放了该slab中第一个空闲对象的索引。

5、空闲对象管理

slab中的对象有 2 种状态：已用或空闲。Slab中使用了一个静态数组来管理空闲对象，该数组元素为kmem_bufctl_t对象(定义为无符号整数)，每个数组元素对应一个slab对象，数组元素中存放下一个空闲对象的索引，该slab中的第一个空闲对象的索引保存在slab->free中，而该数组最后一个元素中总是几个结束标记：BUFCTL_END，如此，slab中利用此巧妙而简洁的方式有效的管理了slab中的空闲对象。相比SunOS中使用链表来管理slab空闲对象，Linux中的设计更巧妙和简洁、效率更高。
kmem_bufctl_t定义如下：

点击(此处)折叠或打开

typedef unsigned int kmem_bufctl_t;

7、slab着色
为了充分利用硬件高速缓存，Slab分配器允许对象在L1硬件高速缓存中对齐，同时使用着色(color)机制。简单说，就是让同一缓冲区内不同 slab 中相同编号的对象的地址相互错开，以便将数据定位到不同的CPU cache line中，从而避免它们被放入同一个cache line中，导致颠簸，最终导致性能损失。

7、per-CPU slab对象缓存

为更好的利用CPU高速缓存，slab实现中，为每个CPU分配了一个slab对象的per-CPU缓存。当slab对象被释放时，会首先放入该缓存，当该缓存中的对象数目超限时，会将其移入相应的slab链表中；当分配slab时，也会首先冲该缓存中查找，如果找到直接返回，如果未找到，才从partial链表中开始查找。当该缓存为空时，会从slab 中批量分配对象到该缓存。使用该per-CPU缓存，能进一步提升slab的分配和释放效率。

该缓存由kmem_cache->array[]管理，kmem_cache->array定义为array_cache结构(数组缓存)。 array_cache结构定义如下：

点击(此处)折叠或打开

/*
* slab中per-CPU缓存列表，当slab对象释放时会先放入此列表中，新对象分配时也会优先从此列表中分配，
* 如果此列表中没有可用的对象，再从相应的slab链表中分配，分配后会一次性填充batchcount个对象到此列表中
*/
struct array_cache {
unsigned int avail;/*列表中当前可用的对象数目*/
unsigned int limit;/*列表中最大的对象数目，跟kmem_cache->limit一样*/
unsigned int batchcount;/*跟kmem_cache->batchcount一样*/
unsigned int touched;/*表示缓存收缩后是否被访问过，在内存回收时需要检查此标记*/
spinlock_t lock;
/*用于存放被释放的slab对象的数组*/
void *entry[]; /*
* Must have this definition in here for the proper
* alignment of array_cache. Also simplifies accessing
* the entries.
*
* Entries should not be directly dereferenced as
* entries belonging to slabs marked pfmemalloc will
* have the lower bits set SLAB_OBJ_PFMEMALLOC
*/
};

该结构中的entry数组用于存放slab对象指针，用于指向最近释放的slab对象。

8、代码分析
主要分析slab对象的分配流程：kmem_cache_alloc()

点击(此处)折叠或打开

/*分配slab对象(object)*/
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
/*实际入口*/
void *ret = slab_alloc(cachep, flags, _RET_IP_);
/*调试使用*/
trace_kmem_cache_alloc(_RET_IP_, ret,
cachep->object_size, cachep->size, flags);
return ret;
}

kmem_cache_alloc()->slab_alloc()

点击(此处)折叠或打开

/*分配slab对象，内联函数*/
static __always_inline void *
slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
{
unsigned long save_flags;
void *objp;
flags &= gfp_allowed_mask;
/*死锁检测相关*/
lockdep_trace_alloc(flags);
if (slab_should_failslab(cachep, flags))
return NULL;
cachep = memcg_kmem_get_cache(cachep, flags);
/*相关标记检查*/
cache_alloc_debugcheck_before(cachep, flags);
/*实际分配操作执行之前，关中断*/
local_irq_save(save_flags);
/*执行实际的slab对象分配操作*/
objp = __do_cache_alloc(cachep, flags);
/*开中断*/
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
flags);
prefetchw(objp);
if (likely(objp))
kmemcheck_slab_alloc(cachep, flags, objp, cachep->object_size);
/*如果设置了__GFP_ZERO标记，则对象数据清零*/
if (unlikely((flags & __GFP_ZERO) && objp))
memset(objp, 0, cachep->object_size);
return objp;
}

kmem_cache_alloc()->slab_alloc()->__do_cache_alloc()

点击(此处)折叠或打开

/*执行实际的slab对象分配操作，此处为非NUMA架构下的入口*/
__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
/*从local node上分配*/
return ____cache_alloc(cachep, flags);
}

kmem_cache_alloc()->slab_alloc()->__do_cache_alloc()->____cache_allc()

点击(此处)折叠或打开

/*slab对象实际分配函数*/
static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
void *objp;
struct array_cache *ac;
bool force_refill = false;
/*确认中断已关*/
check_irq_off();
/*取得当前处理器所在array_cache(即per-CPU缓存列表，简称为AC)*/
ac = cpu_cache_get(cachep);
/*如果ac中有可用的slab对象*/
if (likely(ac->avail)) {
/*设置touched标志，表明最近该ac被访问过，在内存回收时会检查该标记确认相应的缓存是否可回收*/
ac->touched = 1;
/*从ac中获取一个slab对象*/
objp = ac_get_obj(cachep, ac, flags, false);
/*
* Allow for the possibility all avail objects are not allowed
* by the current flags
*/
/*如果成功，更新命中计数，并返回*/
if (objp) {
STATS_INC_ALLOCHIT(cachep);
goto out;
}
force_refill = true;
}
/*如果ac中没有可用的对象，或者从ac中分配对象失败，更新miss计数*/
STATS_INC_ALLOCMISS(cachep);
/*从slab列表中获取对象*/
objp = cache_alloc_refill(cachep, flags, force_refill);
/*
* the 'ac' may be updated by cache_alloc_refill(),
* and kmemleak_erase() requires its correct value.
*/
ac = cpu_cache_get(cachep);
out:
/*
* To avoid a false negative, if an object that is in one of the
* per-CPU caches is leaked, we need to make sure kmemleak doesn't
* treat the array pointers as a reference to the object.
*/
if (objp)
kmemleak_erase(&ac->entry[ac->avail]);
return objp;
}

kmem_cache_alloc()->slab_alloc()->__do_cache_alloc()->____cache_allc()->cache_alloc_refill()

点击(此处)折叠或打开

/*从slab列表中获取slab对象，当现有的slab中的空间不足时，调用cache_grow对齐进行扩展*/
static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
bool force_refill)
{
int batchcount;
struct kmem_cache_node *n;
struct array_cache *ac;
int node;
/*再次确认中断已关*/
check_irq_off();
/*获取当前node id*/
node = numa_mem_id();
/*是否强制grow*/
if (unlikely(force_refill))
goto force_grow;
retry:
/*获取当前ac*/
ac = cpu_cache_get(cachep);
batchcount = ac->batchcount;
/*如果最近该ac不活跃，则需要部分填充*/
if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
/*
* If there was little recent activity on this cache, then
* perform only a partial refill. Otherwise we could generate
* refill bouncing.
*/
batchcount = BATCHREFILL_LIMIT;
}
/*或者针对每个node的slab链表管理结构*/
n = cachep->node[node];
BUG_ON(ac->avail > 0 || !n);
spin_lock(&n->list_lock);
/* See if we can refill from the shared array */
if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
n->shared->touched = 1;
goto alloc_done;
}
while (batchcount > 0) {
struct list_head *entry;
struct slab *slabp;
/* Get slab alloc is to come from. */
/*从partial列表中分配*/
entry = n->slabs_partial.next;
/*当partial列表为空时，从free列表中分配*/
if (entry == &n->slabs_partial) {
n->free_touched = 1;
entry = n->slabs_free.next;
/*free列表为空，则必须grow cache了*/
if (entry == &n->slabs_free)
goto must_grow;
}
slabp = list_entry(entry, struct slab, list);
check_slabp(cachep, slabp);
check_spinlock_acquired(cachep);
/*
* The slab was either on partial or free list so
* there must be at least one object available for
* allocation.
*/
BUG_ON(slabp->inuse >= cachep->num);
/*将新找到的slab中的batchcount个对象放入ac(即per-CPU缓存列表中)*/
while (slabp->inuse < cachep->num && batchcount--) {
STATS_INC_ALLOCED(cachep);
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp,
node));
}
check_slabp(cachep, slabp);
/* move slabp to correct slabp list: */
/*将从中分配对象的slab(此slab可能在partial或free的链表中)重新加入到合适的列表中*/
list_del(&slabp->list);/*Fixme:先删除再添加，有在同一个列表中删除后添加的可能，会不会效率太低了?*/
/*如果该slab中已经没有空闲对象了slabp->free指向下一个空闲对象，BUFCTL_END表示最后一个空闲对象*/
if (slabp->free == BUFCTL_END)
list_add(&slabp->list, &n->slabs_full);
/*如果该slab中还有空闲对象，则加入partial链表中*/
else
list_add(&slabp->list, &n->slabs_partial);
}
must_grow:
n->free_objects -= ac->avail;
alloc_done:
spin_unlock(&n->list_lock);
if (unlikely(!ac->avail)) {
int x;
/*partial和free链表中都没有可用的slab了，则必须新分配内存对kmem_cache进行扩充*/
force_grow:
x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL);
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
/*Fixme:这里node获取后没有再使用呢?*/
node = numa_mem_id();
/*cache grow失败*/
/* no objects in sight? abort */
if (!x && (ac->avail == 0 || force_refill))
return NULL;
/*
* 第一次grow后，通常ac->avail为0，然后会跳转到retry，重新从链表中选择slab，
* 然后重新将其添加到ac中。
* Fixme:这样效率是否也太低了?直接添加到ac中不行么? 有可能在refill的过程中由于开中断导致ac中有了新的slab?
*/
if (!ac->avail) /* objects refilled by interrupt? */
goto retry;
}
ac->touched = 1;
/*grow的流程中，由于前面已经retry，所以这里能保证ac中一定有需要的对象。另外没有grow的流程也会从这返回，此时ac中也一定是有对象可用的*/
return ac_get_obj(cachep, ac, flags, force_refill);
}

kmem_cache_alloc()->slab_alloc()->__do_cache_alloc()->____cache_allc()->cache_alloc_refill()->cache_grow()

点击(此处)折叠或打开

/*kmem_cache增长，每次新增一个slab，当kmem_cache_alloc()中时，发现原有kmem_cache中没有可用的对象时，进入此流程进行扩展*/
static int cache_grow(struct kmem_cache *cachep,
gfp_t flags, int nodeid, void *objp)
{
struct slab *slabp;
size_t offset;
gfp_t local_flags;
struct kmem_cache_node *n;
/*
* Be lazy and only check for valid flags here, keeping it out of the
* critical path in kmem_cache_alloc().
*/
BUG_ON(flags & GFP_SLAB_BUG_MASK);
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
/* Take the node list lock to change the colour_next on this node */
/*再次检查中断是否关闭*/
check_irq_off();
/*获取指定node对应的slab链表管理对象*/
n = cachep->node[nodeid];
/*加锁*/
spin_lock(&n->list_lock);
/* Get colour for the slab, and cal the next value. */
/*着色处理*/
offset = n->colour_next;
n->colour_next++;
if (n->colour_next >= cachep->colour)
n->colour_next = 0;
spin_unlock(&n->list_lock);
offset *= cachep->colour_off;
/*如果设置了__GFP_WAIT标志，需要开中断，允许高优先级任务先执行*/
if (local_flags & __GFP_WAIT)
local_irq_enable();
/*
* The test for missing atomic flag is performed here, rather than
* the more obvious place, simply to reduce the critical path length
* in kmem_cache_alloc(). If a caller is seriously mis-behaving they
* will eventually be caught here (where it matters).
*/
/*检查内存分配标记*/
kmem_flagcheck(cachep, flags);
/*
* Get mem for the objs. Attempt to allocate a physical page from
* 'nodeid'.
*/
/*从伙伴系统中分配物理内存页，用于slab*/
if (!objp)
objp = kmem_getpages(cachep, local_flags, nodeid);
if (!objp)
goto failed;
/* Get slab management. */
/*Slab头部管理数据创建，分On-Slab 和 Off-Slab*/
slabp = alloc_slabmgmt(cachep, objp, offset,
local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
if (!slabp)
goto opps1;
/*创建slab的各页与slab或缓冲区(kmem_cache)之间的关联*/
slab_map_pages(cachep, slabp, objp);
/*调用各slab对象的构造函数(如果有的话)，初始化新slab中的对象。通常都没有。*/
cache_init_objs(cachep, slabp);
if (local_flags & __GFP_WAIT)
local_irq_disable();
check_irq_off();
spin_lock(&n->list_lock);
/* Make slab active. */
/*将新分配的slab加入到free链表中，之前需要拿锁，并关中断*/
list_add_tail(&slabp->list, &(n->slabs_free));
/*更新相关计数*/
STATS_INC_GROWN(cachep);
n->free_objects += cachep->num;
spin_unlock(&n->list_lock);
return 1;
opps1:
kmem_freepages(cachep, objp);
failed:
if (local_flags & __GFP_WAIT)
local_irq_disable();
return 0;
}

阅读(3033) | 评论(0) | 转发(0) |

上一篇：slab着色与cpu硬件高速缓存

下一篇：TCP无保护消息边界的解决

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6