分类: LINUX
2013-05-11 11:29:00
原文作者:Gustavo Duarte
原文地址:http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait
Previously we looked at how the kernel manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often misunderstood relationship between files and memory and its consequences for performance.
作者之前的一系列文章展示了内核是如果管理一个用户进程的虚拟内存,但是没有涉及到文件和I/O。这篇文章将描述一些重要的并且常常被误解的文件与内存之间的关系,以及由此而来的效率问题。
Two serious problems must be solved by the OS when it comes to files. The first one is the mind-blowing slowness of hard drives, and disk seeks in particular, relative to memory. The second is the need to load file contents in physical memory once andshare the contents among programs. If you use to poke at Windows processes, you’ll see there are ~15MB worth of common DLLs loaded in every process. My Windows box right now is running 100 processes, so without sharing I’d be using up to ~1.5 GB of physical RAMjust for common DLLs. No good. Likewise, nearly all Linux programs need ld.so and libc, plus other common libraries.
当处理文件时,有两个关键的问题需要操作系统来解决。一是磁盘的速度相对与内存来说很慢,特别是磁盘的寻道时间很慢。二是需要把文件内容加载到物理内存并且需要在一些程序之间共享这些信息。如果你使用Process Explorer来查看windows进程,你就会发现每个进程都会加载了一个大约15MB的共同DLL。作者的windows上有100个进程,如果没有共享机制,光是这些共同的DLL就会占据大约1.5GB的物理内存,这是不可能的。同样,在所有的linux程序中,都会需要ld.so和libc这两个共享库,当然还会有一些其它常用的库。
Happily, both problems can be dealt with in one shot: the page cache, where the kernel stores page-sized chunks of files. To illustrate the page cache, I’ll conjure a Linux program named render, which opens file scene.dat and reads it 512 bytes at a time, storing the file contents into a heap-allocated block. The first read goes like this:
值得高兴的是,以上两个问题都可以通过一个件事情来缓解:这就是页cache,它被内核用来存储页大小的一系列文件内容。为了描述页cache,我构造了一个叫做render的Linux程序,它会打开scene.dat文件并且一次读取它的512字节数据,把这个文件的内容保存在堆上分配出来的block中。第一次的读操作如下图所示:
1. render会向内核发出从scene.dat读取开头的512字节的请求
2. 内核首先在它当前的页cache中查找是否有4KB大小的一个项是包含scene.dat的满足要求的页cache项。这里假设页cache中还没有相应的数据
3. 内核就会分配页frame,然后发起读取scene.dat的前4KB的数据并初始化刚分配出来的那个页frame。
4. 内核从这个页frame中把开始的512个字节拷贝到相应的用户空间,此时,这次用户的read()系统调用结束。
After 12KB have been read, render‘s heap and the relevant page frames look thus:
当读取了12KB的数据之后, render的堆和相应的页frame如下图所示:
This looks innocent enough, but there’s a lot going on. First, even though this program uses regular read calls, three 4KB page frames are now in the page cache storing part of scene.dat. People are sometimes surprised by this, butall regular file I/O happens through the page cache. In x86 Linux, the kernel thinks of a file as a sequence of 4KB chunks. If you read a single byte from a file, the whole 4KB chunk containing the byte you asked for is read from disk and placed into the page cache. This makes sense because sustained disk throughput is pretty good and programs normally read more than just a few bytes from a file region. The page cache knows the position of each 4KB chunk within the file, depicted above as #0, #1, etc. Windows uses 256KB views analogous to pages in the Linux page cache.
这看起来可能相当明了,但是请接着往下看。首先,虽然该程序使用的是通常的read的库函数,但是却有3块4KB的页frame被页cache所保存,其中包含的是同样的scene.dat的数据。(译者:3块数据是page cache占用,同样内容的3块数据被render的heap占用)大家可能会对这样的行为感到奇怪,但是这是因为所有的普通文件的I/O操作都需要通过页cache机制。在Linux的x86体系结构下,内核认为一个文件是由一系列的4KB的块组成。即使你只读去1个字节,全部的包含该字节的4KB的块都会伴随你的read操作从磁盘中读入到页cache中。这是因为通常的读取操作会要求读取一个文件区域,所以每次读4KB是有意义的。页cache知道对应文件中的4KB块的序号,如上图所示的#0,#1,等等。Windows中使用的是256KB的views这个术语来描述类似与linux中的页cache的概念。
Sadly, in a regular file read the kernel must copy the contents of the page cache into a user buffer, which not only takes cpu time and hurts thecpu caches, but alsowastes physical memory with duplicate data. As per the diagram above, the scene.datcontents are stored twice, and each instance of the program would store the contents an additional time. We’ve mitigated the disk latency problem but failed miserably at everything else. Memory-mapped files are the way out of this madness:
不幸的是,当内核读取一个普通文件的时候,必须要把包含文件内容的页cache拷贝到用户的buffer, 这不仅耗费CPU的时间同时也会影响cpu的caches和浪费物理内存保存这些重复的数据。就像上图所示,scene.dat的内容被存储了2次,每个程序的实例都将保存各自内容。页cache的存在缓和了磁盘速度相对很慢的问题,但是却不能解决其它问题。这时,Memory-mapped files就可以解决我们这里遇到的重复数据占用内存的问题:
When you use file mapping, the kernel maps your program’s virtual pages directly onto the page cache. This can deliver a significant performance boost: reports run time improvements of 30% and up relative to regular file reads, while similar figures are reported for Linux and Solaris in. You might also save large amounts of physical memory, depending on the nature of your application.
当你使用文件mapping的时候,kernel把你的程序的虚拟页直接映射到页cache上。这可以带来很大的效率提升:Windows System Programming一书中提到通过mmap可以对读取普通文件带来30%的效率提升,同样的效率提升也可以在APUE上看到。你也可能通过使用mmap节省更多的物理内存,当然这依赖于你的程序的组织。
As always with performance, measurement is everything, but memory mapping earns its keep in a programmer’s toolbox. The API is pretty nice too, it allows you to access a file as bytes in memory and does not require your soul and code readability in exchange for its benefits. Mind your address space and experiment with in Unix-like systems, in Windows, or the many wrappers available in high level languages. When you map a file its contents are not brought into memory all at once, but rather on demand via. The fault handler onto the page cache after a page frame with the needed file contents. This involves disk I/O if the contents weren’t cached to begin with.
对于效率来说,如何衡量是关键,但是,memory mapping还是值得所有程序猿们关注。相关的API相当不错,它允许你访问文件就像是在内存中访问字节并且不需要你花费过多精力来关注这些。牢记你所处的体系结构的地址空间结构,Unix类似的系统中通过mmap系统调用而windows中使用CreateFileMapping,或者是在其它高级语言中的一些包装函数。当你尝试映射一个文件,它的内容不会立即全部映射到内存,而是当内核检测到page faults后,才会触发内核读入文件内容的动作。页错误的处理程序首先会按照需要的文件内容分配页frame,然后把映射你所需要的虚拟页到对应的页cache。在所需要的内容不在页cache中的时候才会去进行必要的磁盘I/O操作。
Now for a pop quiz. Imagine that the last instance of our render program exits. Would the pages storing scene.dat in the page cache be freed immediately? People often think so, but that would be a bad idea. When you think about it, it is very common for us to create a file in one program, exit, then use the file in a second program. The page cache must handle that case. When you think more about it, why should the kernel ever get rid of page cache contents? Remember that disk is 5 orders of magnitude slower than RAM, hence a page cache hit is a huge win. So long as there’s enough free physical memory, the cache should be kept full. It is therefore not dependent on a particular process, but rather it’s a system-wide resource. If you run render a week from now and scene.dat is still cached, bonus! This is why the kernel cache size climbs steadily until it hits a ceiling. It’s not because the OS is garbage and hogs your RAM, it’s actually good behavior because in a way free physical memory is a waste. Better use as much of the stuff for caching as possible.
现在来测试下。假设我们上面提到的最后一个render程序的实例退出了。是否保存有scene.dat的数据的页cache会立即被释放?人们可能认为会被立即释放,但是这是个坏主意。你可以想一下,通常我们会在一个程序中创建一个文件,然后在另外一个程序中使用它。页cache必须要能处理这种情况。如果你能够想的更多些,为什么内核怎样来丢掉页cache中的内容呢?请记住,磁盘比RAM慢至少5个数量级,因此页cache hit应该尽可能得到保证。一旦有足够的物理内存,cache应该被全部填充。它不是局限于某一个特定程序,而是一个全系统范围内的资源。如果你持续运行render一周,那么scene.dat会一直保存在cache中,从中你得到最大的效率!这就是为什么内核的cache大小会持续上升直到它的一个临界点。这不是OS在浪费你的RAM,而是一个深思熟虑的优化的行为,因为频繁的释放物理内存可能是一种浪费。最大程度的被cache的数据。
Due to the page cache architecture, when a program calls bytes are simply copied to the page cache and the page is marked dirty. Disk I/O normally does not happen immediately, thus your program doesn’t block waiting for the disk. On the downside, if the computer crashes your writes will never make it, hence critical files like database transaction logs must be ed (though one must still worry about drive controller caches, oy!). Reads, on the other hand, normally block your program until the data is available. Kernels employ eager loading to mitigate this problem, an example of which is read ahead where the kernel preloads a few pages into the page cache in anticipation of your reads. You can help the kernel tune its eager loading behavior by providing hints on whether you plan to read a file sequentially or randomly (see,,). Linux for memory-mapped files, but I’m not sure about Windows. Finally, it’s possible to bypass the page cache using in Linux or in Windows, something database software often does.
鉴于页cache的体系结构,当一个程序调用write()系统调用,字节之四好简单的被拷贝到页cache中,并且该页被标示为dirty。磁盘的I/O通常不会立即发生,因此你的程序不会因为等待磁盘操作而被阻塞。不好一方面是,如果你电脑当机了,那么你之前的写操作就再也不会被完成,因此像数据库的存储过程这样的重要文件必须要执行fsync()。(当然,你也可能对drive controller caches不太放心!!)另一方面,读操作通常会阻塞你的程序直到数据真正被获得。内核通过预读之类的eager laoding的方法来减轻这种阻碍,预读就是内核读取比你要求的还多的page进入页cache中。你可以通过内核对外提供的结构来影响内核的eager loading策略,比如你计划是顺序读取还是随机读取(参见,,)。Linux对memory-mapped files进行预读,但是我不确信windows也这么做。当然,你也可以在Linux中使用O_DIRECT或者在windows中使用NO_BUFFERING来跳过页cache,通常一些数据库软件会采用这种做法。
A file mapping may be private or shared. This refers only to updates made to the contents in memory: in a private mapping the updates are not committed to disk or made visible to other processes, whereas in a shared mapping they are. Kernels use thecopy on write mechanism, enabled by page table entries, to implement private mappings. In the example below, bothrender and another program calledrender3d (am I creative or what?) have mappedscene.dat privately.Render then writes to its virtual memory area that maps the file:
一个文件的mapping可以是私有的或者共享的。这只会影响对内存中的内容做updates的操作:在私有模式的mapping中,updates操作不会把结果提交到磁盘或者是本次的updates结果对其它程序可见,而在共享的mapping模式下,确是相反的情况。内核使用copy on write机制,由page table项中的选项域来启用,从而实现私有mapping。下面的例子中,render和render3d都会采用私有模式来map scene.dat。Render会对它的映射了文件内容的虚拟空间进行写操作:
The read-only page table entries shown above do not mean the mapping is read only, they’re merely a kernel trick to share physical memory until the last possible moment. You can see how ‘private’ is a bit of a misnomer until you remember it only applies to updates. A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programsas long as the page has only been read from. Once copy-on-write is done, changes by others are no longer seen. This behavior is not guaranteed by the kernel, but it’s what you get in x86 and makes sense from an API perspective. By contrast, a shared mapping is simply mapped onto the page cache and that’s it. Updates are visible to other processes and end up in the disk. Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
上面的只读的page table项不代表mapping本身是只读,只是内核用来实现更有效率的共享物理内存直到该段内存要被修改的一种手段。这种设计就是,采用私有的模式映射一个文件到虚拟页,只要没有其它程序修改该页,则大家看到的都是同样的数据。一旦copy-on-write发生了,则被其它程序修改的部分,就不能被之前的程序所了解了。这种行为不是有内核保证的,而是由x86的API层保证。相反,共享的map却是大家指向同样的page cache,Update的结果对所有的进程都可见,并最终反映到磁盘上。如果上诉的mapping是只读的,那么页错误将会直接导致segmentation fault而不是copy on write。
Dynamically loaded libraries are brought into your program’s address space via file mapping. There’s nothing magical about it, it’s the same private file mapping available to you via regular APIs. Below is an example showing part of the address spaces from two running instances of the file-mapping render program, along with physical memory, to tie together many of the concepts we’ve seen.
动态加载库通过file mapping来到你的程序空间。这里没有任何神秘的东西,它只是通过通常的APIs提供给你的一个私有模式的file mapping。下图是2个采用file mapping实现的render程序实例的地址空间,包括物理内存及之前我们提到的一些概念:
Melody_lu123:请注意,stack/heap都是对应到各自的page frames,而共享库和通过file maping共享的scene.dat则是共享同样的page frames。