Chinaunix首页 | 论坛 | 博客
  • 博客访问: 120446
  • 博文数量: 19
  • 博客积分: 942
  • 博客等级: 准尉
  • 技术积分: 228
  • 用 户 组: 普通用户
  • 注册时间: 2010-03-08 20:41
文章分类
文章存档

2013年(2)

2012年(5)

2011年(12)

分类: LINUX

2012-01-16 16:03:44

从网上看到的, 绝对是权威资料:)
  1. From: Linus Torvalds
  2. Newsgroups: fa.linux.kernel
  3. Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
  4. Date: Mon, 01 Aug 2005 20:12:32 UTC
  5. Message-ID:
  6. Original-Message-ID:
  7. On Mon, 1 Aug 2005, Hugh Dickins wrote:
  8. >
  9. > > Aside, that brings up an interesting question - why should readonly
  10. > > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace
  11. > > write access while readonly mappings of readonly files not? Or am I
  12. > > horribly confused?
  13. >
  14. > Either you or I. You'll have to spell that out to me in more detail,
  15. > I don't see it that way.
  16. We have always just done a COW if it's read-only - even if it's shared.
  17. The point being that if a process mapped did a read-only mapping, and a
  18. tracer wants to modify memory, the tracer is always allowed to do so, but
  19. it's _not_ going to write anything back to the filesystem. Writing
  20. something back to an executable just because the user happened to mmap it
  21. with MAP_SHARED (but read-only) _and_ the user had the right to write to
  22. that fd is _not_ ok.
  23. So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do
  24. so) to the backing store through ptrace unless it was literally a writable
  25. mapping (in which case VM_WRITE will be set, and the page table should be
  26. marked writable in the first case).
  27. So we have two choices:
  28. - not allow the write at all in ptrace (which I think we did at some
  29. point)
  30. This ends up being really inconvenient, and people seem to really
  31. expect to be able to write to readonly areas in debuggers. And doing
  32. "MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported
  33. that pretty much since day #1 when mmap was supported - long before
  34. writable shared mappings were supported, Linux accepted MAP_SHARED +
  35. PROT_READ not just because we could, but because Unix apps do use it).
  36. or
  37. - turn a shared read-only page into a private page on ptrace write
  38. This is what we've been doing. It's strange, and it _does_ change
  39. semantics (it's not shared any more, so the debugger writing to it
  40. means that now you don't see changes to that file by others), so it's
  41. clearly not "correct" either, but it's certainly a million times better
  42. than writing out breakpoints to shared files..
  43. At some point (for the longest time), when a debugger was used to modify a
  44. read-only page, we also made it writable to the user, which was much
  45. easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing,
  46. which is part of the reason for this particular problem.
  47. Using the dirty flag for a "page is _really_ writable" is admittedly kind
  48. of hacky, but it does have the advantage of working even when the -real-
  49. write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people
  50. to add some more hacks for their strange VM, so be it..
  51. [ Btw, on a totally unrelated note: anybody who is a git user and looks
  52. for when this maybe_mkwrite() thing happened, just doing
  53. git-whatchanged -p -Smaybe_mkwrite mm/memory.c
  54. in the bkcvs conversion pinpoints it immediately. Very useful git trick
  55. in case you ever have that kind of question. ]
  56. I added Martin Schwidefsky to the Cc: explicitly, so that he can ping
  57. whoever in the s390 team needs to figure out what the right thing is for
  58. s390 and the dirty bit semantic change. Thanks for pointing it out.
  59. Linus
  60. From: Linus Torvalds
  61. Newsgroups: fa.linux.kernel
  62. Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
  63. Date: Mon, 01 Aug 2005 22:00:09 UTC
  64. Message-ID:
  65. Original-Message-ID:
  66. On Mon, 1 Aug 2005, Hugh Dickins wrote:
  67. > >
  68. > > We have always just done a COW if it's read-only - even if it's shared.
  69. > >
  70. > > The point being that if a process mapped did a read-only mapping, and a
  71. > > tracer wants to modify memory, the tracer is always allowed to do so, but
  72. > > it's _not_ going to write anything back to the filesystem. Writing
  73. > > something back to an executable just because the user happened to mmap it
  74. > > with MAP_SHARED (but read-only) _and_ the user had the right to write to
  75. > > that fd is _not_ ok.
  76. >
  77. > I'll need to think that through, but not right now. It's a surprise
  78. > to me, and it's likely to surprise the current kernel too.
  79. Well, even if you did the write-back if VM_MAYWRITE is set, you'd still
  80. have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being
  81. set, and I'd expect that to actually be the common one (since you'd
  82. normally use O_RDONLY to open a fd that you only want to map for reading).
  83. And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life.
  84. Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll
  85. get tons of hits. For good reason too - because MAP_PRIVATE isn't actually
  86. coherent on several old UNIXes.
  87. So you'd still have to convert such a case to a COW mapping, so it's not
  88. like you can avoid it.
  89. Of course, if VM_MAYWRITE is not set, you could just convert it silently
  90. to a MAP_PRIVATE at the VM level (that's literally what we used to do,
  91. back when we didn't support writable shared mappings at all, all those
  92. years ago), so at least now the COW behaviour would match the vma_flags.
  93. > I'd prefer to say that if the executable was mapped shared from a writable fd,
  94. > then the tracer will write back to it; but you're clearly against that.
  95. Absolutely. I can just see somebody mapping an executable MAP_SHARED and
  96. PROT_READ, and something as simple as doing a breakpoint while debugging
  97. causing system-wide trouble.
  98. I really don't think that's acceptable.
  99. And I'm not making it up - add PROT_EXEC to the google search around, and
  100. watch it being done exactly that way. Several of the hits mention shared
  101. libraries too.
  102. I strongly suspect that almost all cases will be opened with O_RDONLY, but
  103. still..
  104. Linus
  105. From: Linus Torvalds
  106. Newsgroups: fa.linux.kernel
  107. Subject: Re: [patch 2.6.13-rc4] fix get_user_pages bug
  108. Date: Mon, 01 Aug 2005 22:10:20 UTC
  109. Message-ID:
  110. Original-Message-ID:
  111. On Mon, 1 Aug 2005, Linus Torvalds wrote:
  112. >
  113. > Of course, if VM_MAYWRITE is not set, you could just convert it silently
  114. > to a MAP_PRIVATE at the VM level (that's literally what we used to do,
  115. > back when we didn't support writable shared mappings at all, all those
  116. > years ago), so at least now the COW behaviour would match the vma_flags.
  117. Heh. I just checked. We still do exactly that:
  118. if (!(file->f_mode & FMODE_WRITE))
  119. vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
  120. some code never dies ;)
  121. However, we still set the VM_MAYSHARE bit, and thats' the one that
  122. mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE
  123. doesn't actually ever do anything else than make sure that we try to
  124. allocate a mremap() mapping in a cache-coherent space, I think (ie it's a
  125. total no-op on any sane architecture, and as far as rmap is concerned on
  126. all of them).
  127. Linus
从里面的字缝里看出点内容来, 或者说, 理解了点东西:

映射的实现, 区分了 back storage的读写权限和 map 本身的读写权限
前置是 mmap 前 open 时决定的, 也就是 O_RDONLY 之类的经过转换保存的 file->f_mode 中的东西, FMODE_XXX; 后者既是 mmap 时传递的 PROT_XXX;  还有一类就是 mmap 时的 flag, MAP_XXX, 和权限相关的主要是 MAP_SHARED 和 MAP_PRIVATE.

先理解以下 VM_MAYXXX系列, 
  1. readonly mappings of writeable files (with VM_MAYWRITE set)
那么, 所谓可写文件的只读映射, 就是 PROT_READ (或者 VM_READ) | VM_MAYWRITE
那么, 所谓只读文件的只读映射, 应该就是 PROT_READ (或者 VM_READ) | VM_MAYREAD 了

MAP_SHARED | PROT_READ | VM_MAYWRITE 这类组合该怎么处理? 这应该是上面资料的论题了, 不清楚其他方面, 上面资料说的是 ptrace 是允许写的, 因为 gdb 需要写 READ ONLY 的区域, 比如在代码段加断点; 但不允许同步到磁盘, 理由就是 PROT_READ 是没有写权限的; 实现方法是这类组合将 MAP_SHARED 忽略掉而成为 COW 页面; 不过这样又和 MAP_SHARED 矛盾了 so it's clearly not "correct" either, but it's certainly a million times better than writing out breakpoints to shared files 
  。

相关代码:
vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma)
return -ENOMEM;

vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags(prot,flags) | mm->def_flags;

if (file) {
VM_ClearReadHint(vma);
vma->vm_raend = 0;

if (file->f_mode & FMODE_READ)
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED) {
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;

/* This looks strange, but when we don't have the file open
* for writing, we can demote the shared mapping to a simpler
* private mapping. That also takes care of a security hole
* with ptrace() writing to a shared mapping without write
* permissions.
*
* We leave the VM_MAYSHARE bit on, just to get correct output
* from /proc/xxx/maps..
*/
if (!(file->f_mode & FMODE_WRITE))
vma->vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
}
} else {
vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_SHARED)
vma->vm_flags |= VM_SHARED | VM_MAYSHARE;
}
vma->vm_page_prot = protection_map[vma->vm_flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = pgoff;
vma->vm_file = NULL;
vma->vm_private_data = NULL;

/proc/pid/maps 输出类似 rwxp(or s) 之类, 分别就是可读,写, 执行,私有的或共享的

再加点理解: 为何说 COW 就能防止写入磁盘了? 机制如下:
COW 的时候, 新分配的页面没有加入 vma->vm_file 的 address_space, 因此, page->mapping 是空, 因此, swap 的时候, 这个页面只是作为匿名页面写入交换区, 而不是文件中。
代码如下:
  1. static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
  2.     unsigned long address, pte_t *page_table, pte_t pte)
  3. {
  4.     struct page *old_page, *new_page;

  5.     old_page = pte_page(pte);
  6.     if (!VALID_PAGE(old_page))
  7.         goto bad_wp_page;
  8.     
  9.     /*
  10.      * We can avoid the copy if:
  11.      * - we're the only user (count == 1)
  12.      * - the only other user is the swap cache,
  13.      * and the only swap cache user is itself,
  14.      * in which case we can just continue to
  15.      * use the same swap cache (it will be
  16.      * marked dirty).
  17.      */
  18.     switch (page_count(old_page)) {
  19.     case 2:
  20.         /*
  21.          * Lock the page so that no one can look it up from
  22.          * the swap cache, grab a reference and start using it.
  23.          * Can not do lock_page, holding page_table_lock.
  24.          */
  25.         if (!PageSwapCache(old_page) || TryLockPage(old_page))
  26.             break;
  27.         if (is_page_shared(old_page)) {
  28.             UnlockPage(old_page);
  29.             break;
  30.         }
  31.         UnlockPage(old_page);
  32.         /* FallThrough */
  33.     case 1:
  34.         flush_cache_page(vma, address);
  35.         establish_pte(vma, address, page_table, pte_mkyoung(pte_mkdirty(pte_mkwrite(pte))));
  36.         spin_unlock(&mm->page_table_lock);
  37.         return 1;    /* Minor fault */
  38.     }

  39.     /*
  40.      * Ok, we need to copy. Oh, well..
  41.      */
  42.     spin_unlock(&mm->page_table_lock);
  43.     new_page = page_cache_alloc();
  44.     if (!new_page)
  45.         return -1;
  46.     spin_lock(&mm->page_table_lock);

  47.     /*
  48.      * Re-check the pte - we dropped the lock
  49.      */
  50.     if (pte_same(*page_table, pte)) {
  51.         if (PageReserved(old_page))
  52.             ++mm->rss;
  53.         break_cow(vma, old_page, new_page, address, page_table);

  54.         /* Free the old page.. */
  55.         new_page = old_page;
  56.     }
  57.     spin_unlock(&mm->page_table_lock);
  58.     page_cache_release(new_page);
  59.     return 1;    /* Minor fault */

  60. bad_wp_page:
  61.     spin_unlock(&mm->page_table_lock);
  62.     printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",address,(unsigned long)old_page);
  63.     return -1;
  64. }


阅读(3174) | 评论(0) | 转发(0) |
0

上一篇:一个小总结

下一篇:500 miles away from home

给主人留下些什么吧!~~