从网上看到的, 绝对是权威资料:)
- From: Linus Torvalds
-
Newsgroups: fa.linux.kernel
-
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
-
Date: Mon, 01 Aug 2005 20:12:32 UTC
-
Message-ID:
-
Original-Message-ID:
-
-
On Mon, 1 Aug 2005, Hugh Dickins wrote:
-
>
-
> > Aside, that brings up an interesting question - why should readonly
-
> > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
-
> > write access while readonly mappings of readonly files not? Or am I
-
> > horribly confused?
-
>
-
> Either you or I. You'll have to spell that out to me in more detail,
-
> I don't see it that way.
-
-
We have always just done a COW if it's read-only - even if it's shared.
-
-
The point being that if a process mapped did a read-only mapping, and a
-
tracer wants to modify memory, the tracer is always allowed to do so, but
-
it's _not_ going to write anything back to the filesystem. Writing
-
something back to an executable just because the user happened to mmap it
-
with MAP_SHARED (but read-only) _and_ the user had the right to write to
-
that fd is _not_ ok.
-
-
So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do
-
so) to the backing store through ptrace unless it was literally a writable
-
mapping (in which case VM_WRITE will be set, and the page table should be
-
marked writable in the first case).
-
-
So we have two choices:
-
-
- not allow the write at all in ptrace (which I think we did at some
-
point)
-
-
This ends up being really inconvenient, and people seem to really
-
expect to be able to write to readonly areas in debuggers. And doing
-
"MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported
-
that pretty much since day #1 when mmap was supported - long before
-
writable shared mappings were supported, Linux accepted MAP_SHARED +
-
PROT_READ not just because we could, but because Unix apps do use it).
-
-
or
-
-
- turn a shared read-only page into a private page on ptrace write
-
-
This is what we've been doing. It's strange, and it _does_ change
-
semantics (it's not shared any more, so the debugger writing to it
-
means that now you don't see changes to that file by others), so it's
-
clearly not "correct" either, but it's certainly a million times better
-
than writing out breakpoints to shared files..
-
-
At some point (for the longest time), when a debugger was used to modify a
-
read-only page, we also made it writable to the user, which was much
-
easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing,
-
which is part of the reason for this particular problem.
-
-
Using the dirty flag for a "page is _really_ writable" is admittedly kind
-
of hacky, but it does have the advantage of working even when the -real-
-
write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people
-
to add some more hacks for their strange VM, so be it..
-
-
[ Btw, on a totally unrelated note: anybody who is a git user and looks
-
for when this maybe_mkwrite() thing happened, just doing
-
-
git-whatchanged -p -Smaybe_mkwrite mm/memory.c
-
-
in the bkcvs conversion pinpoints it immediately. Very useful git trick
-
in case you ever have that kind of question. ]
-
-
I added Martin Schwidefsky to the Cc: explicitly, so that he can ping
-
whoever in the s390 team needs to figure out what the right thing is for
-
s390 and the dirty bit semantic change. Thanks for pointing it out.
-
-
Linus
-
-
From: Linus Torvalds
-
Newsgroups: fa.linux.kernel
-
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
-
Date: Mon, 01 Aug 2005 22:00:09 UTC
-
Message-ID:
-
Original-Message-ID:
-
-
On Mon, 1 Aug 2005, Hugh Dickins wrote:
-
> >
-
> > We have always just done a COW if it's read-only - even if it's shared.
-
> >
-
> > The point being that if a process mapped did a read-only mapping, and a
-
> > tracer wants to modify memory, the tracer is always allowed to do so, but
-
> > it's _not_ going to write anything back to the filesystem. Writing
-
> > something back to an executable just because the user happened to mmap it
-
> > with MAP_SHARED (but read-only) _and_ the user had the right to write to
-
> > that fd is _not_ ok.
-
>
-
> I'll need to think that through, but not right now. It's a surprise
-
> to me, and it's likely to surprise the current kernel too.
-
-
Well, even if you did the write-back if VM_MAYWRITE is set, you'd still
-
have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being
-
set, and I'd expect that to actually be the common one (since you'd
-
normally use O_RDONLY to open a fd that you only want to map for reading).
-
-
And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life.
-
Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll
-
get tons of hits. For good reason too - because MAP_PRIVATE isn't actually
-
coherent on several old UNIXes.
-
-
So you'd still have to convert such a case to a COW mapping, so it's not
-
like you can avoid it.
-
-
Of course, if VM_MAYWRITE is not set, you could just convert it silently
-
to a MAP_PRIVATE at the VM level (that's literally what we used to do,
-
back when we didn't support writable shared mappings at all, all those
-
years ago), so at least now the COW behaviour would match the vma_flags.
-
-
> I'd prefer to say that if the executable was mapped shared from a writable fd,
-
> then the tracer will write back to it; but you're clearly against that.
-
-
Absolutely. I can just see somebody mapping an executable MAP_SHARED and
-
PROT_READ, and something as simple as doing a breakpoint while debugging
-
causing system-wide trouble.
-
-
I really don't think that's acceptable.
-
-
And I'm not making it up - add PROT_EXEC to the google search around, and
-
watch it being done exactly that way. Several of the hits mention shared
-
libraries too.
-
-
I strongly suspect that almost all cases will be opened with O_RDONLY, but
-
still..
-
-
Linus
-
-
From: Linus Torvalds
-
Newsgroups: fa.linux.kernel
-
Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
-
Date: Mon, 01 Aug 2005 22:10:20 UTC
-
Message-ID:
-
Original-Message-ID:
-
-
On Mon, 1 Aug 2005, Linus Torvalds wrote:
-
>
-
> Of course, if VM_MAYWRITE is not set, you could just convert it silently
-
> to a MAP_PRIVATE at the VM level (that's literally what we used to do,
-
> back when we didn't support writable shared mappings at all, all those
-
> years ago), so at least now the COW behaviour would match the vma_flags.
-
-
Heh. I just checked. We still do exactly that:
-
-
if (!(file->f_mode & FMODE_WRITE))
-
vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
-
-
some code never dies ;)
-
-
However, we still set the VM_MAYSHARE bit, and thats' the one that
-
mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE
-
doesn't actually ever do anything else than make sure that we try to
-
allocate a mremap() mapping in a cache-coherent space, I think (ie it's a
-
total no-op on any sane architecture, and as far as rmap is concerned on
-
all of them).
-
-
Linus
从里面的字缝里看出点内容来, 或者说, 理解了点东西:
映射的实现, 区分了 back storage的读写权限和 map 本身的读写权限
前置是 mmap 前 open 时决定的, 也就是 O_RDONLY 之类的经过转换保存的 file->f_mode 中的东西, FMODE_XXX; 后者既是 mmap 时传递的 PROT_XXX; 还有一类就是 mmap 时的 flag, MAP_XXX, 和权限相关的主要是 MAP_SHARED 和 MAP_PRIVATE.
先理解以下 VM_MAYXXX系列,
- readonly mappings of writeable files (with VM_MAYWRITE set)
那么, 所谓可写文件的只读映射, 就是 PROT_READ (或者 VM_READ) | VM_MAYWRITE
那么, 所谓只读文件的只读映射, 应该就是 PROT_READ (或者 VM_READ) | VM_MAYREAD 了
MAP_SHARED | PROT_READ | VM_MAYWRITE 这类组合该怎么处理? 这应该是上面资料的论题了, 不清楚其他方面, 上面资料说的是 ptrace 是允许写的, 因为 gdb 需要写 READ ONLY 的区域, 比如在代码段加断点; 但不允许同步到磁盘, 理由就是 PROT_READ 是没有写权限的; 实现方法是这类组合将 MAP_SHARED 忽略掉而成为 COW 页面; 不过这样又和 MAP_SHARED 矛盾了 so it's clearly not "correct" either, but it's certainly a million times better than writing out breakpoints to shared files
。
相关代码:
vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma)
return -ENOMEM;
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags(prot,flags) | mm->def_flags;
if (file) {
VM_ClearReadHint(vma);
vma->vm_raend = 0;
if (file->f_mode & FMODE_READ)
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED) {
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;
/* This looks strange, but when we don't have the file open
* for writing, we can demote the shared mapping to a simpler
* private mapping. That also takes care of a security hole
* with ptrace() writing to a shared mapping without write
* permissions.
*
* We leave the VM_MAYSHARE bit on, just to get correct output
* from /proc/xxx/maps..
*/
if (!(file->f_mode & FMODE_WRITE))
vma->vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
}
} else {
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED)
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;
}
vma->vm_page_prot = protection_map[vma->vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
/proc/pid/maps 输出类似 rwxp(or s) 之类, 分别就是可读,写, 执行,私有的或共享的
再加点理解: 为何说 COW 就能防止写入磁盘了? 机制如下:
COW 的时候, 新分配的页面没有加入 vma->vm_file 的 address_space, 因此, page->mapping 是空, 因此, swap 的时候, 这个页面只是作为匿名页面写入交换区, 而不是文件中。
代码如下:
- static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
-
unsigned long address, pte_t *page_table, pte_t pte)
-
{
-
struct page *old_page, *new_page;
-
-
old_page = pte_page(pte);
-
if (!VALID_PAGE(old_page))
-
goto bad_wp_page;
-
-
/*
-
* We can avoid the copy if:
-
* - we're the only user (count == 1)
-
* - the only other user is the swap cache,
-
* and the only swap cache user is itself,
-
* in which case we can just continue to
-
* use the same swap cache (it will be
-
* marked dirty).
-
*/
-
switch (page_count(old_page)) {
-
case 2:
-
/*
-
* Lock the page so that no one can look it up from
-
* the swap cache, grab a reference and start using it.
-
* Can not do lock_page, holding page_table_lock.
-
*/
-
if (!PageSwapCache(old_page) || TryLockPage(old_page))
-
break;
-
if (is_page_shared(old_page)) {
-
UnlockPage(old_page);
-
break;
-
}
-
UnlockPage(old_page);
-
/* FallThrough */
-
case 1:
-
flush_cache_page(vma, address);
-
establish_pte(vma, address, page_table, pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
-
spin_unlock(&mm->page_table_lock);
-
return 1; /* Minor fault */
-
}
-
-
/*
-
* Ok, we need to copy. Oh, well..
-
*/
-
spin_unlock(&mm->page_table_lock);
-
new_page = page_cache_alloc();
-
if (!new_page)
-
return -1;
-
spin_lock(&mm->page_table_lock);
-
-
/*
-
* Re-check the pte - we dropped the lock
-
*/
-
if (pte_same(*page_table, pte)) {
-
if (PageReserved(old_page))
-
++mm->rss;
-
break_cow(vma, old_page, new_page, address, page_table);
-
-
/* Free the old page.. */
-
new_page = old_page;
-
}
-
spin_unlock(&mm->page_table_lock);
-
page_cache_release(new_page);
-
return 1; /* Minor fault */
-
-
bad_wp_page:
-
spin_unlock(&mm->page_table_lock);
-
printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",address,(unsigned long)old_page);
-
return -1;
-
}
阅读(3232) | 评论(0) | 转发(0) |