分类: LINUX
2012-05-27 18:42:43
第一代Unix系统实现了一种傻瓜式的进程创建:当发出fork()系统调用时,内核原样复制父进程的整个地址空间并把复制的那一份分配给子进程。这种行为是非常耗时的,因为它需要:
- 为子进程的页表分配页框
- 为子进程的页分配页框
- 初始化子进程的页表
- 把父进程的页复制到子进程相应的页中
这种创建地址空间的方法涉及许多内存访问,消耗许多CPU周期,并且完全破坏了高速缓存中的内容。在大多数情况下,这样做常常是毫无意义的,因为许多子进程通过装入一个新的程序开始它们的执行,这样就完全丢弃了所继承的地址空间。
现在的Linux内核采用一种更为有效的方法,称之为写时复制(Copy On Write,COW)。这种思想相当简单:父进程和子进程共享页框而不是复制页框。然而,只要页框被共享,它们就不能被修改,即页框被保护。无论父进程还 是子进程何时试图写一个共享的页框,就产生一个异常,这时内核就把这个页复制到一个新的页框中并标记为可写。原来的页框仍然是写保护的:当其他进程试图写 入时,内核检查写进程是否是这个页框的唯一属主,如果是,就把这个页框标记为对这个进程是可写的。
页描述符的count字段用于跟踪共享相应页框的进程数目。只要进程释放一个页框或者在它上面执行写时复制,它的count字段就减小;只有当count变为-1时,这个页框才被释放,这个知识在前面博文已经讲过。
现在我们讲述Linux怎样实现写时复制。上偏博文的handle_pte_fault()函数确定缺页异常是由访问内存中现有的一个页而引起时,我们来回忆一下:
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (!pte_same(old_entry, entry)) {
ptep_set_access_flags(vma, address, pte, entry, write_access);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (write_access)
flush_tlb_page(vma, address);
}
unlock:
pte_unmap_unlock(pte, ptl);
return VM_FAULT_MINOR;
handle_pte_fault()函数是与体系结构无关的:它考虑任何违背页访问权限的可能。然而,在80x86体系结构上,如果页是存在的,那么,访问权限是写允许的(write_access=1)而页框是写保护的(参见前面“处理地址空间内的错误地址”一博)。因此,总是要调用do_wp_page()函数。
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
{
struct page *old_page, *new_page;
pte_t entry;
int reuse = 0, ret = VM_FAULT_MINOR;
struct page *dirty_page = NULL;
int dirty_pte = 0;
old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page)
goto gotten;
/*
* Take out anonymous pages first, anonymous shared vmas are
* not dirty accountable.
*/
if (PageAnon(old_page)) {
if (TestSetPageLocked(old_page)) {
page_cache_get(old_page);
pte_unmap_unlock(page_table, ptl);
lock_page(old_page);
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
if (!pte_same(*page_table, orig_pte)) {
unlock_page(old_page);
page_cache_release(old_page);
goto unlock;
}
page_cache_release(old_page);
}
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
/*
* Only catch write-faults on shared writable pages,
* read-only shared pages can get COWed by
* get_user_pages(.write=1, .force=1).
*/
vfs_check_frozen(vma->vm_file->f_dentry->d_inode->i_sb,
SB_FREEZE_WRITE);
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
* Notify the address space that the page is about to
* become writable so that it can prohibit this or wait
* for the page to get into an appropriate state.
*
* We do this without the lock held, so that it can
* sleep if it needs to.
*/
page_cache_get(old_page);
pte_unmap_unlock(page_table, ptl);
if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
goto unwritable_page;
page_cache_release(old_page);
/*
* Since we dropped the lock we need to revalidate
* the PTE as someone else may have changed it. If
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
if (!pte_same(*page_table, orig_pte))
goto unlock;
}
dirty_page = old_page;
get_page(dirty_page);
reuse = 1;
}
if (reuse) {
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
dirty_pte++;
ptep_set_access_flags(vma, address, page_table, entry, 1);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
ret |= VM_FAULT_WRITE;
goto unlock;
}
/*
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
gotten:
pte_unmap_unlock(page_table, ptl);
if (unlikely(anon_vma_prepare(vma)))
goto oom;
if (old_page == ZERO_PAGE(address)) {
new_page = alloc_zeroed_user_highpage(vma, address);
if (!new_page)
goto oom;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address);
}
/*
* Re-check the pte - we dropped the lock
*/
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
page_remove_rmap(old_page);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
trace_mm_filemap_cow(mm, address, new_page);
}
} else {
inc_mm_counter(mm, anon_rss);
trace_mm_anon_cow(mm, address, new_page);
}
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
dirty_pte++;
lazy_mmu_prot_update(entry);
/*
* Clear the pte entry and flush it first, before updating the
* pte with the new entry. This will avoid a race condition
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
ptep_clear_flush_notify(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
/* Free the old page.. */
new_page = old_page;
ret |= VM_FAULT_WRITE;
}
if (new_page)
page_cache_release(new_page);
if (old_page)
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
if (flush_mmap_pages || !dirty_pte)
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
return ret;
oom:
if (old_page)
page_cache_release(old_page);
return VM_FAULT_OOM;
unwritable_page:
page_cache_release(old_page);
return VM_FAULT_SIGBUS;
}
do_wp_page()函数(为了简化对这个函数的说明,我们还是略过处理反映射的语句)首先获取与缺页异常相关的页框描述符(缺页表项对应的页框):
old_page = vm_normal_page(vma, address, orig_pte);
接下来,函数确定页的复制是否真正必要。如果仅有一个进程拥有这个页,那么,写时复制就不必应用,且该进程应当自由地写该页。具体来说,函数读取页描述符的_count字段:如果它等于0(只有一个所有者),写时复制就不必进行。
实际上,检查要稍微复杂些,因为当页插入到交换高速缓存(并且当设置了页描述符的PG_private标志时,_count字段也增加。不过,当写时复制不进行时,就把该页框标记为可写的,以免试图写时引起进一步的缺页异常:
set_pte(page_table, maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),vma));
flush_tlb_page(vma, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
如果两个或多个进程通过写时复制共享页框,那么函数就把旧页框(old page)的内容复制到新分配的页框(new page)中。为了避免竞争条件,在开始复制操作前调用get_page()把old_page的使用计数加1:
old_page = pte_page(pte);
pte_unmap(page_table);
get_page(old_page);
spin_unlock(&mm->page_table_lock);
if (old_page == virt_to_page(empty_zero_page))
new_page = alloc_page(GFP_HIGHUSER | _ _GFP_ZERO);
} else {
new_page = alloc_page(GFP_HIGHUSER);
vfrom = kmap_atomic(old_page, KM_USER0)
vto = kmap_atomic(new_page, KM_USER1);
copy_page(vto, vfrom);
kunmap_atomic(vfrom, KM_USER0);
kunmap_atomic(vto, KM_USER0);
}
如果旧页框是零页,就在分配新的页框时(__GFP_ZERO标志)把它填充为0。否则,使用copy_page()宏复制页框的内容。不要求一定要对零页做特殊的处理,但是特殊处理确实能够提高系统的性能,因为它减少地址引用而保护了微处理器的硬件高速缓存。
因为页框的分配可能阻塞进程,因此,函数检查自从函数开始执行以来是否已经修改了页表项(pte和*page_table具有不同的值)。如果是,新的页框被释放,old_page的使用计数器被减少(取消以前的增加),函数结束。
如果所有的事情看起来进展顺利,那么,新页框的物理地址最终被写进页表项,且使用相应的TLB寄存器无效:
spin_lock(&mm->page_table_lock);
entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page,
vma->vm_page_prot)),vma);
set_pte(page_table, entry);
flush_tlb_page(vma, address);
lru_cache_add_active(new_page);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
lru_cache_add_active()函数把新页框插人到与交换相关的数据结构中。
最后,do_wp_page()把old_page的使用计数器减少两次。第一次的减少是取消复制页框内容之前进行的安全性增加;第二次的减少是反映当前进程不再拥有该页框这一事实。