全部博文(47)
分类: LINUX
2008-09-26 15:14:35
Video codec team has found their codec standalone application perform worse under 2.6.24 kernel than 2.6.19 kernel. In order to find the root cause quickly, they used a set of unit test cases to do performance test, and gave out the result that one case(read large case) has big differences performance between these two kernels, say mainly 20%.
The codec performance is figured out by duration
(precision is us) from one frame decoding beginning to the end. The
performance is critical on our embedded multimedia system, which is
also important to potential customer. So I got the unit test codes, and
start to analysis. Tring to find why it has such differences, and how
to fix such gap.
This memset write access to vma will cause page fault handler to allocate physical pages for them. Even there's no write access to static variable in test case codes, kernel module is already allocating physical pages for them. Compare to user space, the kernel changes of "Remove ZERO PAGE" makes different behavior. Under 2.6.19 kernel, there's a optimized mechanism for read access to not allocated anonymous page, called ZERO PAGE. It will create page table pointing to this zero page (only one zero page in kernel) when do anonymous page fault with read access to non allocated vm. Therefore, as the test program has no write access to the static buffer, read is all from such zero page, the L1 cache accelerate the read speed. Because of removing ZERO PAGE in kernel 2.6.20 version, 2.6.24 kernel will allocated a cleared physical page for both read and write access to anonymous vm page when doing page fault. There's no zero page existed, so the buffer read will read from different physical pages, and L1 cache can not help. The page fault handler patch snatch is listed below:/* Do the allocs. */
ptr = module_alloc(mod->core_size);
if (!ptr) {
err = -ENOMEM;
goto free_percpu;
}
memset(ptr, 0, mod->core_size);
mod->module_core = ptr;
ptr = module_alloc(mod->init_size);if (!ptr && mod->init_size) {
err = -ENOMEM;
goto free_core;
}
memset(ptr, 0, mod->init_size);
mod->module_init = ptr;
@@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,From the patch,we can see do_anonymous_page handler do not handle write/read access separately. Zero page is disappeared.
spinlock_t *ptl;
pte_t entry;
- if (write_access) {
- /* Allocate our own private page. */
- pte_unmap(page_table);
-
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
- page = alloc_zeroed_user_highpage_movable(vma, address);
- if (!page)
- goto oom;
-
- entry = mk_pte(page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ /* Allocate our own private page. */
+ pte_unmap(page_table);
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
- goto release;
- inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(page);
- page_add_new_anon_rmap(page, vma, address);
- } else {
- /* Map the ZERO_PAGE - vm_page_prot is readonly */
- page = ZERO_PAGE(address);
- page_cache_get(page);
- entry = mk_pte(page, vma->vm_page_prot);
+ if (unlikely(anon_vma_prepare(vma)))
+ goto oom;
+ page = alloc_zeroed_user_highpage_movable(vma, address);
+ if (!page)
+ goto oom;
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (!pte_none(*page_table))
- goto release;
- inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
- }
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_none(*page_table))
+ goto release;
+ inc_mm_counter(mm, anon_rss);
+ lru_cache_add_active(page);
+ page_add_new_anon_rmap(page, vma, address);
set_pte_at(mm, address, page_table, entry);
/* No need to invalidate - it was non-present before */
commit 557ed1fa2620dc119adb86b34c614e152a629a80
Author: Nick Piggin <>
Date: Tue Oct 16 01:24:40 2007 -0700
remove ZERO_PAGE
The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.
And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).
There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely
I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.
Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).
As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.
When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.
Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.
The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.
Signed-off-by: Nick Piggin <>
Signed-off-by: Andrew Morton <>
Signed-off-by: Linus "snif" Torvalds <>