分类: LINUX
2011-01-22 13:33:02
性能有显著提升
The
first step is the introduction of two new lock types, the first of
which is called a "local/global lock" (lglock). An lglock is intended to
provide very fast access to per-CPU data while making it possible (at a
rather higher cost) to get at another CPU's data.
Sometimes it is necessary to globally lock the lglock:
A call to lg_global_lock() will go through the entire array, acquiring the spinlock for every CPU. Needless to say, this will be a very expensive operation;The first use of lglocks is to protect the list of open files which is attached to each superblock structure. This list is currently protected by the global files_lock, which becomes a bottleneck when a lot of open() and close() calls are being made. In 2.6.36, the list of open files becomes a per-CPU array, with each CPU managing its own list. When a file is opened, a (cheap) call to lg_local_lock() suffices to protect the local list while the new file is added.
When a file is closed, things are just a bit more complicated. There is no guarantee that the file will be on the local CPU's list, so the VFS must be prepared to reach across to another CPU's list to clean things up. That, of course, is what lg_local_lock_cpu() is for.1 One billion files on Linux
.
几个结论
mkfs:
ext4 generally performs better than ext3. Ext3/4 are much slower than
the others at creating filesystems, due to the need to create the static
inode tables.
fsck:Everybody except ext3 performs reasonably well when running fsck. 但是运行大文件系统需要大量的内存,如ext4和xfs
removing:The big loser when it comes to removing those million files is XFS.
It is meant to (someday) replace setrlimit(); the differences include the ability to modify limits belonging to other processes and the ability to query and set a limit in a single operation.
Current rules are pretty simple:
1) ->drop_inode() is called when we release the last reference to
struct inode. It tells us whether fs wants inode to be evicted (as
opposed to retained in inode cache). Doesn't do actual eviction (as it
used to), just returns an int. The normal policy is "if it's unhashed or
has no links left, evict it now". generic_drop_inode() does these
checks. NULL ->drop_inode means that it'll be used.
generic_delete_inode() is "just evict it". Or fs can set rules of its
own; grep and you'll see.
2) ->delete_inode() and
->clear_inode() are gone; ->evict_inode() is called in all cases
when inode (without in-core references to it) is about to be kicked out,
no matter why that happens (->drop_inode() telling that it shouldn't
be kept around, memory pressure, umount, etc.) It will be called
exactly once per inode's lifetime. Once it returns, inode is basically
just a piece of memory about to be freed.
3)
->evict_inode() _must_ call end_writeback(inode) at some point. At
that point all async access from VFS (writeback, basically) will be
completed and inode will be fs's to deal with. That's what calls of
clear_inode() in original ->delete_inode() should turn into. Don't
dirty an inode past that point; it never worked to start with (writeback
logics would've refused to trigger ->write_inode() on such inodes)
and now it'll be detected and whined about.
4) kicking the
pages out of page cache (== calling truncate_inode_pages()) is up to
->evict_inode() instance; that was already the case for
->delete_inode(), but not for ->clear_inode(). Of course, if fs
doesn't use page cache for that inode, it doesn't have to bother. Other
than that, ->evict_inode() instance is basically a mix of old
->clear_inode() and ->delete_inode(). inodes with NULL
->evict_inode() behave exactly as ones with NULL ->delete_inode()
and NULL ->clear_inode() used to.
Testing tools
xfstests package about 70 of the 240 tests are now generic. Xfstests is concerned primarily with regression testing; it is not, generally, a performance-oriented test suite.
tests must be run under most or all reasonable combinations of mount options to get good coverage. Ric Wheeler also pointed out that different types of storage have very different characteristics.ests which exercise more of the virtual memory and I/O paths would also be nice. There is one package which covers much of this ground: FIO, available from . Destructive power failure testing is another useful area which Red Hat (at least) is beginning to do. There has also been some work done using hdparm to corrupt individual sectors on disk to see how filesystems respond. A wishlist item was better latency measurement, with an emphasis on seeing how long I/O requests sit within drivers which do their own queueing.
Memory-management testing
Filesystem freeze/thaw
The enables a system administrator to suspend writes to a filesystem, allowing it to be backed up or snapshotted while in a consistent state. It had its origins in XFS, but has since become part of the Linux VFS layer. The biggest of these problems is unmounting:The proper way to handle freeze is to return a file descriptor; as long as that file descriptor is held open, the filesystem remains frozen. This solves the "last process exits" problem because the file descriptor will be closed as the process exits, automatically causing the filesystem to be thawed.Barriers
Transparent hugepages
mmap_sem
The memory map semaphore (mmap_sem) is a reader-writer semaphore which protects the tree of virtual memory area (VMA) structures describing each address space. It is, Nick Piggin says, one of the last nasty locking issues left in the virtual memory subsystem. Like many busy, global locks, mmap_sem can cause scalability problems through cache line bouncing.
Dirty limits
The tuning knob found at /proc/sys/vm/dirty_ratio contains a number representing a percentage of total memory. Any time that the number of dirty pages in the system exceeds that percentage, processes which are actually writing data will be forced to perform some writeback directly. This policy has a couple of useful results: it helps keep memory from becoming entirely filled with dirty pages, and it serves to throttle the processes which are creating dirty pages in the first place.
The default value for dirty_ratio is 20,. But that turns out to be too low for a number of applications. For this reason, distributions like RHEL raise this limit to 40% by default.
But 40% is not an ideal number either; it can lead to a lot of wasted memory when the system's workloads are mostly sequential. Lots of dirty pages can also cause fsync() calls to take a very long time, especially with the ext3 filesystem. What's really needed is a way to set this parameter in a more automatic, adaptive way, but exactly how that should be done is not entirely clear.
Lightning talks
There are two fundamental types of workload at Google. "Shared" workloads work like classic mainframe batch jobs, contending for resources while the system tries to isolate them from each other. "Dedicated workloads" are the ones which actually make money for Google - indexing, searching, and such - and are very sensitive to performance degradation. In general, any new kernel which shows a 1% or higher performance regression is deemed to not be good enough.
The workloads exhibit a lot of big, sequential writes and smaller, random reads. Disk I/O latencies matter a lot for dedicated workloads; 15ms latencies can cause phone calls to the development group. The systems are typically doing direct I/O on not-too-huge files, with logging happening on the side. The disk is shared between jobs, with the I/O bandwidth controller used to arbitrate between them.
Why is direct I/O used? It's a decision which dates back to the 2.2 days, when buffered I/O worked less well than it does now. Things have gotten better, but, meanwhile, Google has moved much of its buffer cache management into user space. It works much like enterprise database systems do, and, chances are, that will not change in the near future.
Google uses the "fake NUMA" feature to partition system memory into 128MB chunks. These chunks are assigned to jobs, which are managed by control groups. The intent is to firmly isolate all of these jobs, but writeback still can cause interference between them.