lwn.net kernel news 2010/8-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621437
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2010/8

分类： LINUX

2011-01-22 13:33:02

1 Preventing overly-optimistic spinning
为了避免大量线程spinning同一个锁，线程连续两次检查mutex的owner，如果owner发生了变化，表明该锁竞争激烈，这时候线程应该睡眠，见

2 When memory allocation failure is not an option
去掉__GFP_NOFAIL（imposes complexity on the page allocator which slows things down for all users），但是代之以 adding kmalloc_nofail() and friends，这样的话常规路径就不会受到影响了。

3 VFS scalability patches in 2.6.36

性能有显著提升

The first step is the introduction of two new lock types, the first of which is called a "local/global lock" (lglock). An lglock is intended to provide very fast access to per-CPU data while making it possible (at a rather higher cost) to get at another CPU's data.

Underneath it all, an lglock is really just a per-CPU array of spinlocks. As long as almost all locking is local, it will be very fast; the lock will not bounce between CPUs and will not be contended.

Sometimes it is necessary to globally lock the lglock:

A call to lg_global_lock() will go through the entire array, acquiring the spinlock for every CPU. Needless to say, this will be a very expensive operation;

The VFS scalability patch set also brings back the "big reader lock" concept. The idea behind a brlock is to make locking for read access as fast as possible, while making write locking possible. The brlock API also defined in 。

The first use of lglocks is to protect the list of open files which is attached to each superblock structure. This list is currently protected by the global files_lock, which becomes a bottleneck when a lot of open() and close() calls are being made. In 2.6.36, the list of open files becomes a per-CPU array, with each CPU managing its own list. When a file is opened, a (cheap) call to lg_local_lock() suffices to protect the local list while the new file is added.

When a file is closed, things are just a bit more complicated. There is no guarantee that the file will be on the local CPU's list, so the VFS must be prepared to reach across to another CPU's list to clean things up. That, of course, is what lg_local_lock_cpu() is for.

Also for 2.6.36, Nick changed the global vfsmount_lock into a brlock. This lock protects the tree of mounted filesystems; it must be acquired (in a read-only mode) whenever a pathname lookup crosses from one mount point to the next. Write access is only needed when filesystems are mounted or unmounted - again, an uncommon occurrence on most systems.

4 Statistics and tracepoints
tracepoints有成为ABI的可能，目前尚无对策。

大量的材料需要消化

1 One billion files on Linux

几个结论

mkfs： ext4 generally performs better than ext3. Ext3/4 are much slower than the others at creating filesystems, due to the need to create the static inode tables.

fsck：Everybody except ext3 performs reasonably well when running fsck. 但是运行大文件系统需要大量的内存，如ext4和xfs

removing：The big loser when it comes to removing those million files is XFS.

XFS, for all of its strengths, struggles when faced with metadata-intensive workloads。
running ls on a huge filesystem is "a bad idea"; readdir可能会好很多。
In general, enumeration of files tends to be slow; A related problem is backup and/or replication. That, too, will take a very long time, and it can badly affect the performance of other things running at the same time.

2 The end of block barriers

问题所在：in the real world, barriers are implemented by simply draining the I/O request queue prior to issuing the barrier operation, with some flush operations thrown in to get the hardware to actually commit the data to persistent media. Queue-drain operations will stall the device and kill the parallelism needed for full performance;
结论：
the ordering semantics provided by block-layer barriers are much stronger than necessary. Filesystems need to ensure that certain requests are executed in a specific order, and they need to ensure that specific requests have made it to the physical media before starting others.
这样的话其他的I/O就不会barrier干扰

实现：gets rid of hard-barrier operations in the block layer; A filesystem which wants operations to happen in a specific order will simply need to issue them in the proper order, waiting for completion when necessary. The block layer can then reorder requests at will. while barrier requests are going away, "flush requests" will replace them. （因为写操作完成不能保证磁盘写完成，磁盘内部还有cache）

信息比较丰富
1 2.6.36 merge window: the sequel

The ext3 filesystem, once again, defaults to the (safer) "ordered" mode at mount time.
The .
The has been merged.
There is a new system call for working with resource limits: int prlimit64(pid_t pid, unsigned int resource,
const struct rlimit64 *new_rlim, struct rlimit64 *old_rlim);
It is meant to (someday) replace setrlimit(); the differences include the ability to modify limits belonging to other processes and the ability to query and set a limit in a single operation.
There are a few new build-time configuration commands: listnewconfig outputs a list of new configuration options, oldnoconfig sets all new configuration options to "no" without asking, alldefconfig sets all options to their default values, and savedefconfig writes a minimal configuration file in defconfig.
The concurrency-managed workqueues patch set has been merged, completely changing the way workqueues are implemented.
The cpuidle mechanism has been enhanced to allow for the set of available idle states to change over time. Details can be found in .
There is a new super_operations method called evict_inode(); it handles all of the necessary work when an in-core inode is being removed. It should be used instead of clear_inode() and delete_inode().

Actually, there's a bit more to ->evict_inode() story: ->drop_inode() has also changed.

Current rules are pretty simple:

1) ->drop_inode() is called when we release the last reference to struct inode. It tells us whether fs wants inode to be evicted (as opposed to retained in inode cache). Doesn't do actual eviction (as it used to), just returns an int. The normal policy is "if it's unhashed or has no links left, evict it now". generic_drop_inode() does these checks. NULL ->drop_inode means that it'll be used. generic_delete_inode() is "just evict it". Or fs can set rules of its own; grep and you'll see.

2) ->delete_inode() and ->clear_inode() are gone; ->evict_inode() is called in all cases when inode (without in-core references to it) is about to be kicked out, no matter why that happens (->drop_inode() telling that it shouldn't be kept around, memory pressure, umount, etc.) It will be called exactly once per inode's lifetime. Once it returns, inode is basically just a piece of memory about to be freed.

3) ->evict_inode() _must_ call end_writeback(inode) at some point. At that point all async access from VFS (writeback, basically) will be completed and inode will be fs's to deal with. That's what calls of clear_inode() in original ->delete_inode() should turn into. Don't dirty an inode past that point; it never worked to start with (writeback logics would've refused to trigger ->write_inode() on such inodes) and now it'll be detected and whined about.

4) kicking the pages out of page cache (== calling truncate_inode_pages()) is up to ->evict_inode() instance; that was already the case for ->delete_inode(), but not for ->clear_inode(). Of course, if fs doesn't use page cache for that inode, it doesn't have to bother. Other than that, ->evict_inode() instance is basically a mix of old ->clear_inode() and ->delete_inode(). inodes with NULL ->evict_inode() behave exactly as ones with NULL ->delete_inode() and NULL ->clear_inode() used to.

That's it. Original was much more convoluted...

The inotify mechanism has been removed from inside the kernel; the fsnotify mechanism must be used instead.

2 The 2010 Linux Storage and Filesystem Summit, day 1

Testing tools

xfstests package about 70 of the 240 tests are now generic. Xfstests is concerned primarily with regression testing; it is not, generally, a performance-oriented test suite.

tests must be run under most or all reasonable combinations of mount options to get good coverage. Ric Wheeler also pointed out that different types of storage have very different characteristics.ests which exercise more of the virtual memory and I/O paths would also be nice. There is one package which covers much of this ground: FIO, available from . Destructive power failure testing is another useful area which Red Hat (at least) is beginning to do. There has also been some work done using hdparm to corrupt individual sectors on disk to see how filesystems respond. A wishlist item was better latency measurement, with an emphasis on seeing how long I/O requests sit within drivers which do their own queueing.

Memory-management testing

The question here is simple: how can memory management changes be tested to satisfy everybody? Sometimes running a single benchmark is not enough; many memory management problems are only revealed when the system comes under a combination of stresses.

Filesystem freeze/thaw
The enables a system administrator to suspend writes to a filesystem, allowing it to be backed up or snapshotted while in a consistent state. It had its origins in XFS, but has since become part of the Linux VFS layer. The biggest of these problems is unmounting:The proper way to handle freeze is to return a file descriptor; as long as that file descriptor is held open, the filesystem remains frozen. This solves the "last process exits" problem because the file descriptor will be closed as the process exits, automatically causing the filesystem to be thawed.
Barriers

"barrier" operations; all writes issued before a barrier are supposed to complete before any writes issued after the barrier. The problem is that barriers have not always been well supported in the Linux block subsystem, and, when they are supported,they have a significant impact on performance. barriers, as such, will be no more. The problem of ordering will be placed entirely in the hands of filesystem developers

Transparent hugepages

At the core of the patch is a new thread called khugepaged, which is charged with scanning memory and creating hugepages where it makes sense. Other parts of the VM can split those hugepages back into normally-sized pieces when the need arises. Khugepaged works by allocating a hugepage, then using the migration mechanism to copy the contents of the smaller component pages over. There was some talk of trying to defragment memory and "collapse in place" instead, but it doesn't seem worth the effort at this point.

mmap_sem
The memory map semaphore (mmap_sem) is a reader-writer semaphore which protects the tree of virtual memory area (VMA) structures describing each address space. It is, Nick Piggin says, one of the last nasty locking issues left in the virtual memory subsystem. Like many busy, global locks, mmap_sem can cause scalability problems through cache line bouncing.
Dirty limits
The tuning knob found at /proc/sys/vm/dirty_ratio contains a number representing a percentage of total memory. Any time that the number of dirty pages in the system exceeds that percentage, processes which are actually writing data will be forced to perform some writeback directly. This policy has a couple of useful results: it helps keep memory from becoming entirely filled with dirty pages, and it serves to throttle the processes which are creating dirty pages in the first place.
The default value for dirty_ratio is 20,. But that turns out to be too low for a number of applications. For this reason, distributions like RHEL raise this limit to 40% by default.
But 40% is not an ideal number either; it can lead to a lot of wasted memory when the system's workloads are mostly sequential. Lots of dirty pages can also cause fsync() calls to take a very long time, especially with the ext3 filesystem. What's really needed is a way to set this parameter in a more automatic, adaptive way, but exactly how that should be done is not entirely clear.

3 The 2010 Linux Storage and Filesystem Summit, day 2

writeback

It is always easier for the writeback system to fail to keep up with processes which are dirtying pages, leading to poor performance(系统变大，但是性能没有跟上）. The system should work much more aggressively to ensure that the writeback rate matches the dirty rate.
Linux writeback now works by flushing out pages belonging to a specific file (inode) at a time, with the hope that those pages will be located nearby on the disk. The memory management code will normally ask the filesystem to flush out up to 4MB of data for each inode. One poorly-kept secret of Linux memory management is that filesystems routinely ignore that request - they typically flush far more data than requested if there are that many dirty pages.

solid-state storage devices

These devices are getting faster: they are heading toward a point where they can perform one million I/O operations per second.
Current SSDs work reasonably well with Linux, but there are certainly some problems. There is far too much overhead in the ATA and SCSI layers; at that kind of operation rate, microseconds hurt. The block layer's request queues are becoming a bottleneck; it's currently only possible to have about 32 concurrent operations outstanding on a device. The system needs to be able to distribute I/O completion work across multiple CPUs, preferably using smart controllers which can direct each completion interrupt to the CPU which initiated a specific operation in the first place.

I/O bandwidth controller problem.

Part of that problem has been solved - there is now a proportional-weight bandwidth controller in the mainline kernel. This controller works well for single-spindle drives, perhaps a bit less so with large arrays. With larger systems, the single dispatch queue in the CFQ scheduler becomes a bottleneck.

Lightning talks

There are two fundamental types of workload at Google. "Shared" workloads work like classic mainframe batch jobs, contending for resources while the system tries to isolate them from each other. "Dedicated workloads" are the ones which actually make money for Google - indexing, searching, and such - and are very sensitive to performance degradation. In general, any new kernel which shows a 1% or higher performance regression is deemed to not be good enough.

The workloads exhibit a lot of big, sequential writes and smaller, random reads. Disk I/O latencies matter a lot for dedicated workloads; 15ms latencies can cause phone calls to the development group. The systems are typically doing direct I/O on not-too-huge files, with logging happening on the side. The disk is shared between jobs, with the I/O bandwidth controller used to arbitrate between them.

Why is direct I/O used? It's a decision which dates back to the 2.2 days, when buffered I/O worked less well than it does now. Things have gotten better, but, meanwhile, Google has moved much of its buffer cache management into user space. It works much like enterprise database systems do, and, chances are, that will not change in the near future.

Google uses the "fake NUMA" feature to partition system memory into 128MB chunks. These chunks are assigned to jobs, which are managed by control groups. The intent is to firmly isolate all of these jobs, but writeback still can cause interference between them.

1 AppArmor set to be merged for 2.6.36
2 Yama: not so fast
3 Whack-a-droid
仍在为通用的需求做出挣扎
4 2.6.36 merge window part 1

Support for the LIRC infrared controller API has been merged, along with a long list of LIRC drivers.
The ARM architecture has lost support for the "discontigmem" memory model; it is expected that everybody is using sparsemem at this point. ARM has also switched from the old bootmem allocator to memblock (formerly LMB) and added support for the -fstack-protector GCC feature.
The PM_QOS API has changed again; quality-of-service requests are now added with: void pm_qos_add_request(struct pm_qos_request_list *request,
int pm_qos_class, s32 value);

5 Data temperature in Btrfs
aimed at enabling multi-level caching within the Btrfs filesystem.
This code, instead, is meant to add the infrastructure needed to determine which data within a filesystem is "hot"; other work, to be done in the near future, will then be able to make use of this information to determine which data would benefit from being hosted on faster media - on a solid-state storage device, perhaps.
6 The IRMOS realtime scheduler
前景未知，没有看

阅读(868) | 评论(0) | 转发(0) |

上一篇：JVM线程的实现方式

下一篇：线程库GNU Pth 和 NGPT浅析

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6