前面已经分析了slub分配算法的初始化及slab资源池的创建,现在进一步分析一下slub分配算法的分配实现。
kmem_cache_alloc()是申请slab对象的入口函数,其实现:
-
【file:/mm/slub.c】
-
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
-
{
-
void *ret = slab_alloc(s, gfpflags, _RET_IP_);
-
-
trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size,
-
s->size, gfpflags);
-
-
return ret;
-
}
该函数主要是通过slab_alloc()来分配对象,而trace_kmem_cache_alloc()则是用于记录slab分配轨迹。 那么接下来分析slab_alloc()的实现:
-
【file:/mm/slub.c】
-
static __always_inline void *slab_alloc(struct kmem_cache *s,
-
gfp_t gfpflags, unsigned long addr)
-
{
-
return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
-
}
该函数封装了slab_alloc_node():
-
【file:/mm/slub.c】
-
/*
-
* Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
-
* have the fastpath folded into their functions. So no function call
-
* overhead for requests that can be satisfied on the fastpath.
-
*
-
* The fastpath works by first checking if the lockless freelist can be used.
-
* If not then __slab_alloc is called for slow processing.
-
*
-
* Otherwise we can simply pick the next object from the lockless free list.
-
*/
-
static __always_inline void *slab_alloc_node(struct kmem_cache *s,
-
gfp_t gfpflags, int node, unsigned long addr)
-
{
-
void **object;
-
struct kmem_cache_cpu *c;
-
struct page *page;
-
unsigned long tid;
-
-
if (slab_pre_alloc_hook(s, gfpflags))
-
return NULL;
-
-
s = memcg_kmem_get_cache(s, gfpflags);
-
redo:
-
/*
-
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
-
* enabled. We may switch back and forth between cpus while
-
* reading from one cpu area. That does not matter as long
-
* as we end up on the original cpu again when doing the cmpxchg.
-
*
-
* Preemption is disabled for the retrieval of the tid because that
-
* must occur from the current processor. We cannot allow rescheduling
-
* on a different processor between the determination of the pointer
-
* and the retrieval of the tid.
-
*/
-
preempt_disable();
-
c = __this_cpu_ptr(s->cpu_slab);
-
-
/*
-
* The transaction ids are globally unique per cpu and per operation on
-
* a per cpu queue. Thus they can be guarantee that the cmpxchg_double
-
* occurs on the right processor and that there was no operation on the
-
* linked list in between.
-
*/
-
tid = c->tid;
-
preempt_enable();
-
-
object = c->freelist;
-
page = c->page;
-
if (unlikely(!object || !node_match(page, node)))
-
object = __slab_alloc(s, gfpflags, node, addr, c);
-
-
else {
-
void *next_object = get_freepointer_safe(s, object);
-
-
/*
-
* The cmpxchg will only match if there was no additional
-
* operation and if we are on the right processor.
-
*
-
* The cmpxchg does the following atomically (without lock
-
*
-
* 1. Relocate first pointer to the current per cpu area.
-
* 2. Verify that tid and freelist have not been changed
-
* 3. If they were not changed replace tid and freelist
-
*
-
* Since this is without lock semantics the protection is only
-
* against code executing on this cpu *not* from access by
-
* other cpus.
-
*/
-
if (unlikely(!this_cpu_cmpxchg_double(
-
s->cpu_slab->freelist, s->cpu_slab->tid,
-
object, tid,
-
next_object, next_tid(tid)))) {
-
-
note_cmpxchg_failure("slab_alloc", s, tid);
-
goto redo;
-
}
-
prefetch_freepointer(s, next_object);
-
stat(s, ALLOC_FASTPATH);
-
}
-
-
if (unlikely(gfpflags & __GFP_ZERO) && object)
-
memset(object, 0, s->object_size);
-
-
slab_post_alloc_hook(s, gfpflags, object);
-
-
return object;
-
}
如果开启了CONFIG_SLUB_DEBUG配置的情况下,将会经由slab_pre_alloc_hook()对slub分配进行预处理,确保申请OK;接着如果开启了CONFIG_MEMCG_KMEM配置的情况下,将会通过memcg_kmem_get_cache()将kmem_cache结构指针转换为mcgroup组的kmem_cache指针。
接下来的slab分配,将会通过preempt_disable()先行禁止内核抢占,继而通过__this_cpu_ptr()获取当前CPU的kmem_cache_cpu结构,随后取得kmem_cache_cpu的tid值后preempt_enable()使能内核抢占。
往下的if (unlikely(!object || !node_match(page,
node)))判断当前CPU的slab空闲列表是否为空或者当前slab使用内存页面与管理节点是否不匹配。如果其中某一条件为否定,则将通过__slab_alloc()进行slab分配;否则将进入else分支进行分配操作。
__slab_alloc()的分配稍后分析,现在看一下else分支的动作。其先经get_freepointer_safe()取得slab中空闲对象地址,接着使用this_cpu_cmpxchg_double()原子指令操作取得该空闲对象,如果获取成功将使用prefetch_freepointer()刷新数据,否则将经note_cmpxchg_failure()记录日志后重回redo标签再次尝试分配。这里面的关键是this_cpu_cmpxchg_double()原子指令操作。该原子操作主要做了三件事情:1)重定向首指针指向当前CPU的空间;2)判断tid和freelist未被修改;3)如果未被修改,也就是相等,确信此次slab分配未被CPU迁移,接着将新的tid和freelist数据覆盖过去以更新。
具体将this_cpu_cmpxchg_double()的功能展开用C语言表述就是:
if
((__this_cpu_ptr(s->cpu_slab->freelist) == object) &&
(__this_cpu_ptr(s->cpu_slab->tid) == tid))
{
__this_cpu_ptr(s->cpu_slab->freelist) = next_object;
__this_cpu_ptr(s->cpu_slab->tid) = next_tid(tid);
return true;
}
else
{
return
false;
}
这里使用原子操作,其通过单指令方式实现完以上功能免除了加锁解锁操作,且完全避免了多核的情况下CPU迁移锁资源等待所带来的性能开销,极大地提升了效率。这在通常的程序开发中消除性能瓶颈也是极佳的手段。
完了分配到对象后,将会根据申请标志__GFP_ZERO将该对象进行格式化操作,然后经由slab_post_alloc_hook()进行对象分配后处理。
最后回来看一下__slab_alloc()函数实现:
-
【file:/mm/slub.c】
-
/*
-
* Slow path. The lockless freelist is empty or we need to perform
-
* debugging duties.
-
*
-
* Processing is still very fast if new objects have been freed to the
-
* regular freelist. In that case we simply take over the regular freelist
-
* as the lockless freelist and zap the regular freelist.
-
*
-
* If that is not working then we fall back to the partial lists. We take the
-
* first element of the freelist as the object to allocate now and move the
-
* rest of the freelist to the lockless freelist.
-
*
-
* And if we were unable to get a new slab from the partial slab lists then
-
* we need to allocate a new slab. This is the slowest path since it involves
-
* a call to the page allocator and the setup of a new slab.
-
*/
-
static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-
unsigned long addr, struct kmem_cache_cpu *c)
-
{
-
void *freelist;
-
struct page *page;
-
unsigned long flags;
-
-
local_irq_save(flags);
-
#ifdef CONFIG_PREEMPT
-
/*
-
* We may have been preempted and rescheduled on a different
-
* cpu before disabling interrupts. Need to reload cpu area
-
* pointer.
-
*/
-
c = this_cpu_ptr(s->cpu_slab);
-
#endif
-
-
page = c->page;
-
if (!page)
-
goto new_slab;
-
redo:
-
-
if (unlikely(!node_match(page, node))) {
-
stat(s, ALLOC_NODE_MISMATCH);
-
deactivate_slab(s, page, c->freelist);
-
c->page = NULL;
-
c->freelist = NULL;
-
goto new_slab;
-
}
-
-
/*
-
* By rights, we should be searching for a slab page that was
-
* PFMEMALLOC but right now, we are losing the pfmemalloc
-
* information when the page leaves the per-cpu allocator
-
*/
-
if (unlikely(!pfmemalloc_match(page, gfpflags))) {
-
deactivate_slab(s, page, c->freelist);
-
c->page = NULL;
-
c->freelist = NULL;
-
goto new_slab;
-
}
-
-
/* must check again c->freelist in case of cpu migration or IRQ */
-
freelist = c->freelist;
-
if (freelist)
-
goto load_freelist;
-
-
stat(s, ALLOC_SLOWPATH);
-
-
freelist = get_freelist(s, page);
-
-
if (!freelist) {
-
c->page = NULL;
-
stat(s, DEACTIVATE_BYPASS);
-
goto new_slab;
-
}
-
-
stat(s, ALLOC_REFILL);
-
-
load_freelist:
-
/*
-
* freelist is pointing to the list of objects to be used.
-
* page is pointing to the page from which the objects are obtained.
-
* That page must be frozen for per cpu allocations to work.
-
*/
-
VM_BUG_ON(!c->page->frozen);
-
c->freelist = get_freepointer(s, freelist);
-
c->tid = next_tid(c->tid);
-
local_irq_restore(flags);
-
return freelist;
-
-
new_slab:
-
-
if (c->partial) {
-
page = c->page = c->partial;
-
c->partial = page->next;
-
stat(s, CPU_PARTIAL_ALLOC);
-
c->freelist = NULL;
-
goto redo;
-
}
-
-
freelist = new_slab_objects(s, gfpflags, node, &c);
-
-
if (unlikely(!freelist)) {
-
if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
-
slab_out_of_memory(s, gfpflags, node);
-
-
local_irq_restore(flags);
-
return NULL;
-
}
-
-
page = c->page;
-
if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
-
goto load_freelist;
-
-
/* Only entered in the debug case */
-
if (kmem_cache_debug(s) &&
-
!alloc_debug_processing(s, page, freelist, addr))
-
goto new_slab; /* Slab failed checks. Next slab needed */
-
-
deactivate_slab(s, page, get_freepointer(s, freelist));
-
c->page = NULL;
-
c->freelist = NULL;
-
local_irq_restore(flags);
-
return freelist;
-
}
__slab_alloc()是slab申请的慢路径,这是由于freelist是空的或者需要执行调试任务。
该函数会先行local_irq_save()禁止本地处理器的中断并且记住它们之前的状态。如果配置CONFIG_PREEMPT了,为了避免因调度切换到不同的CPU,该函数会重新通过this_cpu_ptr()获取CPU域的指针;如果c->page为空,也就是cpu local slab不存在就经由new_slab分支新分配一个。
当c->page不为空的情况下,会经node_match()判断页面与节点是否匹配,如果节点不匹配就通过deactivate_slab()去激活cpu本地slab;再然后通过pfmemalloc_match()判断当前页面属性是否为pfmemalloc,如果不是则同样去激活。
接着会再次检查空闲对象指针freelist是否为空,避免在禁止本地处理器中断前因发生了CPU迁移或者中断,导致本地的空闲对象指针不为空。如果不为空的情况下,将会跳转至load_freelist,这里将会把对象从空闲队列中取出,并更新数据信息,然后恢复中断使能,返回对象地址。如果为空,将会更新慢路径申请对象的统计信息,并通过get_freelist()从页面中获取空闲队列。if (!freelist)表示获取空闲队列失败,此时则需要创建新的slab,否则更新统计信息进入load_freelist分支取得对象并返回。
最后看一下该函数的new_slab分支的实现,首先会if (c->partial)判断partial是否为空,不为空则从partial中取出,然后跳转回redo重试分配。如果partial为空,意味着当前所有的slab都已经满负荷使用,那么则需使用new_slab_objects()创建新的slab。如果创建失败,那么将if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())判断申请页面是否配置为无告警,并且送往控制台的消息数量在临界值内,则调用slab_out_of_memory()记录日志后使能中断并返回NULL表示申请失败。否则将会if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page,
gfpflags)))判断是否未开启调试且页面属性匹配pfmemalloc,是则跳转至load_freelist分支进行slab对象分配;否则将会经if (kmem_cache_debug(s) && !alloc_debug_processing(s, page,
freelist, addr)) 判断,若开启调试并且调试初始化失败,则返回创建新的slab。如果未开启调试或page调试初始化失败,都将会deactivate_slab()去激活该page,使能中断并返回。
深入分析一下new_slab_objects()函数实现:
-
【file:/mm/slub.c】
-
static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
-
int node, struct kmem_cache_cpu **pc)
-
{
-
void *freelist;
-
struct kmem_cache_cpu *c = *pc;
-
struct page *page;
-
-
freelist = get_partial(s, flags, node, c);
-
-
if (freelist)
-
return freelist;
-
-
page = new_slab(s, flags, node);
-
if (page) {
-
c = __this_cpu_ptr(s->cpu_slab);
-
if (c->page)
-
flush_slab(s, c);
-
-
/*
-
* No other reference to the page yet so we can
-
* muck around with it freely without cmpxchg
-
*/
-
freelist = page->freelist;
-
page->freelist = NULL;
-
-
stat(s, ALLOC_SLAB);
-
c->page = page;
-
*pc = c;
-
} else
-
freelist = NULL;
-
-
return freelist;
-
}
该函数在尝试创建新的slab前,将先通过get_partial()获取存在空闲对象的slab并将对象返回;否则继而通过new_slab()创建slab,如果创建好slab后,将空闲对象链表摘下并返回。
这就是slub分配算法的对象分配流程。