lwn.net kernel news 2010/10-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621623
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2010/10

分类： LINUX

2011-02-15 12:07:21

本期可看性比较好
1 Running work in hardware interrupt context
The intended area of use is apparently code running from non-maskable interrupts which needs to be able to interact with the rest of the system.
2 Jump label
目的，取消关掉的测试，类似以前alternative的做法.
#define JUMP_LABEL(key, label) \
if (unlikely(*key)) \
goto label;
如何取消unlikely呢？
(1) note the location of the test and the key value in a special table, and (2) simply insert a no-op instruction.
A call to enable_jump_label() will look up the key in the jump label table, then replace the special no-op instructions with the assembly equivalent of "goto label", enabling the tracepoint. Disabling the jump label will cause the no-op instruction to be restored.
3 2.6.37 merge window, part 1

The x86 architecture now uses separate stacks for interrupt handling when 8K stacks are in use. The option to use 4K stacks has been removed.
The scheduler now works harder to avoid migrating high-priority realtime tasks. The scheduler also will no longer charge processor time used to handle interrupts to the process which happened to be running at the time.
The block layer can now throttle I/O bandwidth to specific devices, controlled by the cgroup mechanism
Yet another RCU variant has been added: "tiny preempt RCU" is meant for uniprocessor systems. "This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers."
A long list of changes to the memblock (formerly LMB) low-level management code has been merged, and the x86 architecture now uses memblock for its early memory management.
The default handling for lseek() has changed: if a driver does not provide its own llseek() function, the VFS layer will cause all attempts to change the file position to fail with an ESPIPE error. All in-tree drivers which lacked llseek() functions have been changed to use noop_llseek(), which preserves the previous behavior.

The patch has been merged, with an associated API change. See the new Documentation/vm/highmem.txt file for details.
Most of the work needed to from the block layer has been merged. This task will probably be completed before the closing of the merge window.

4 Resolving the inode scalability discussion
目标：去掉全局锁
The global inode_lock is used within the virtual filesystem layer (VFS) to protect several data structures and a wide variety of inode-oriented operations.

Nick's patch set creates separate global locks for some of those resources: wb_inode_list_lock for the list of inodes under writeback, and inode_lru_lock for the list of inodes in the cache. The standalone inodes_stat statistics structure is converted over to atomic types. Then the existing i_lock per-inode spinlock is used to cover everything else in the inode structure;

Al would like to see the writeback locks taken prior to i_lock (because code tends to work from the list first, prior to attacking individual inodes), but he says the LRU lock should be taken after i_lock because code changing the LRU status of an inode will normally already have that inode's lock.

Al has also the way he would like things to proceed

5Linux at NASDAQ OMX
operates exchanges all over the world - and they run on Linux.
Latency， throughput,reliability
To meet these requirements, NASDAQ OMX runs large clusters of thousands of machines. These clusters can process hundreds of millions of orders per day - up to one million orders per second - with 250µs latency.
The NAPI interrupt mitigation technique for network drivers has, on its own, freed up about 1/3 of the available CPU time for other work. The epoll system call cuts out much of the per-call overhead, taking 33µs off of the latency in one benchmark. Handling clock_gettime() in user space via the VDSO page cuts almost another 60ns. Bob was also quite pleased with how the Linux page cache works; it is effective enough, he says, to eliminate the need to use asynchronous I/O, simplifying the code considerably.

弊端：
On the other hand, there are some things which have not worked out as well for them. These include I/O signals; they are complex to program with and, if things get busy, the signal queue can overflow. The user-space libaio asynchronous I/O (AIO) implementation is thread-based; it scales poorly, he says, and does not integrate well with epoll. Kernel-based asynchronous I/O, instead, lacks proper socket support. He also mentioned the recvmsg() system call, which requires a call into the kernel for every incoming packet.

期望的方向：
The new recvmmsg() system call can receive multiple packets with a single call.
What NASDAQ OMX would really like to see in Linux now is good socket-based AIO. That would make it possible to replace epoll/recvmsg/sendmsg sequences with fewer system calls. Even better would be if the kernel could provide notifications for multiple events at a time.

The 2.6.36 kernel is out, released on October 20.
前四条没有再看的价值.
1 Dueling inode scalability patches
两种方案的争斗
2 IMA memory hog
设计上的不合理导致integrity measurement architecture (IMA)占据大量radix tree内存.
3 Shielding driver authors from locking
两种观点的交锋， the appeal of taking care of locking for them and letting them concentrate on getting their hardware to do reasonable things is clear, especially if it makes the code review process easier as well. Such efforts may ultimately be successful, but there can be no doubt that they will run into disagreement from those who feel that kernel developers should either understand what is going on or switch to Java development.
4 A netlink-based user-space crypto API
如何User-space access to the kernel cryptography subsystem ？一种新的方案.
5 trace-cmd: A front-end for Ftrace （重点）
Previous LWN articles have explained the basic way to use Ftrace directly through the debugfs filesystem ( and ). 本次介绍command-line tool that works with Ftrace

1 Linsched for 2.6.35 released
Linsched is a user-space simulator intended to run the Linux scheduling subsystem; it is intended to help developers working on improving (or understanding) the scheduler.
2 No fanotify for 2.6.36 暂时屏蔽
3 ARM's multiply-mapped memory mess
The ioremap() system call, used to map I/O memory for CPU use. 当用ioremap 映射 system memory（该内存已被正常映射），行为未确定，但是当前很多ARM驱动依赖于此，不愿意修改.
4 Synaptics multitouch coming - maybe
又是专利权和逆向工程.
5 Statistics for the 2.6.36 development cycle
因为删除a bunch of defconfig files，
Perhaps more interesting is this set of numbers: in 2.6.36, the development community added 604,000 lines of code and deleted 651,000 - for a total loss of almost 47,000 lines of code. This is the first time since the beginning of the git era that the size of the kernel source has gone down.

代码变更比例最小，
At 1.6% of the total, 2.6.36 represents a relatively small piece of the total code base - the smallest for a long time. Almost 29% of the kernel code still dates back to the beginning of the git era（2.6.12）, down from 31% last February. While much of our kernel code is quite new - 31% of the code comes from 2.6.30 or newer - much of it has also hung around for a long time.

1 Little-endian PowerPC
PPC是大端阵营的，at least some PowerPC processors can optionally be run in a little-endian mode.

为什么需要？A number of GPUs, especially those aimed at embedded applications, only work in the little-endian mode. 如果ppc工作在大端模式，则需用驱动和用户态程序做大量的工作， Running the processor in little-endian mode will nicely overcome that obstacle.

存在的问题：kernel有patch了，但是but there are toolchain changes required which are not, yet, generally available.

2 Trusted and encrypted keys
现状：
TPM-using integrity measurement architecture (IMA), which can measure and attest to the integrity of a running Linux system, has been part of the kernel for some time now, the related extended verification module (EVM) has not made it into the mainline. The existing IMA code only solves part of the integrity problem, leaving the detection of offline attacks against disk files (e.g. by mounting the disk under another OS) to EVM.

如何更好实现EVM:
Mimi Zohar's could also be used for other purposes such as handling the keys for filesystem encryption.

The basic idea is that these keys would be generated by the kernel, and would never be touched by user space in an unencrypted form. Encrypted "blobs" would be provided to user space by the kernel and would contain the key material. User space could store the keys, for example, but the blobs would be completely opaque to anything outside of the kernel. The patches come with two new flavors of these in-kernel keys: trusted and encrypted.

3 Two ABI troubles
一个是不应该修改已有的API，一个是tracepoint是否应该成为ABI.

4 Solid-state storage devices and the block layer
大量的技术细节，值得一看，参见
SSD的出现显示了bottlenecks in the Linux filesystem and block layers.

未有SSD之前： most I/O patterns are dominated by random I/O and relatively small requests. Thus, getting the best results requires being able to perform a large number of I/O operations per second (IOPS). With a high-end rotating drive (running at 15,000 RPM), the maximum rate possible is about 500 IOPS. Most real-world drives, of course, will have significantly slower performance and lower I/O rates.
有了SSD后：SSDs, by eliminating seeks and rotational delays, we have gone from hundreds of IOPS to hundreds of thousands of IOPS in a very short period of time.

改进做法：

首先要识别SSD，因为硬件检测并不总是可靠，通过标志位/sys/block//queue/rotational
取消或消弱SSD的 queue plugging.On a rotating disk, the first I/O operation to show up in the request queue will cause the queue to be "plugged," meaning that no operations will actually be dispatched to the hardware. The idea behind plugging is that, by allowing a little time for additional I/O requests to arrive, the block layer will be able to merge adjacent requests (reducing the operation count) and sort them into an optimal order, increasing performance. Performance on SSDs tends not to benefit from this treatment, though there is still a little value to merging requests. Dropping (or, at least, reducing) plugging not only eliminates a needless delay; it also reduces the need to take the queue lock in the process.
改进request timeouts.老的实现involved a separate timeout for each outstanding request, but that clearly does not scale when the number of such requests can be huge. The answer was to go to a per-queue timer, reducing the number of running timers considerably.
关闭对熵的贡献queue/add_random .rotational disk拥有 inherently unpredictable execution times，但是SDs lack mechanical parts moving around, so their completion times are much more predictable.原因是add_timer_randomness() has to acquire a global lock, causing unpleasant systemwide contention.
减少锁的竞争 __make_request() is responsible for getting a request (represented by a BIO structure) onto the queue. Two lock acquisitions are required to do this job - three if the CFQ I/O scheduler is in use. Those two acquisitions are the result of a lock split done to reduce contention in the past; that split, when the system is handling requests at SSD speeds, makes things worse. Eliminating it led to a roughly 3% increase in IOPS with a reduction in CPU time on a 32-core system.
drop the I/O request allocation batching - a mechanism added to increase throughput on rotating drives by allowing the simultaneous submission of multiple requests.
drop the allocation accounting code, which tracks the number of requests in flight at any given time. Counting outstanding I/O operations requires global counters and the associated contention, but it can be done without most of the time.
to reduce contention keeping processing on the same CPU as often as possible.the submission of a specific I/O request 和request's completion尽量在一个CPU上进行，减少锁的bounce和slab对象在不同cpu的申请释放. 网络子系统的做法 this problem has been addressed with techniques like , block I/O 的做法：controllers are not able to direct specific I/O completion interrupts to specific CPUs. That solution took the form of smp_call_function(), which performs fast cross-CPU calls. Using smp_call_function(), the block I/O completion code can direct the completion of specific requests to the CPU where those requests were initially submitted. The result is a relatively easy performance improvement.
the interrupt mitigation code 模拟NAPI的做法，The blk-iopoll code turns off completion interrupts when I/O traffic is high and uses polling to pick up completed events instead.
"context plugging," a rework of the queue plugging code，正在进行的工作. Currently, queue plugging is done implicitly on I/O submission, with an explicit unplug required at a later time. The plan is to make plugging and unplugging fully implicit, but give I/O submitters a way to inform the block layer that more requests are coming soon. It makes the code more clear and robust; it also gets rid of a lot of expensive per-queue state which must be maintained.
一种可能的技术（等待硬件支持），目前有许多问题：multiqueue block layer - an idea which, once again, came from the networking layer. The creation of multiple I/O queues for a given device will allow multiple processors to handle I/O requests simultaneously with less contention. It's currently hard to do, though, because block I/O controllers do not (yet) have multiqueue support.

阅读(839) | 评论(0) | 转发(0) |

上一篇：M:N 线程模型漫谈

下一篇：lwn.net kernel news 2010/11

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6