本期可看性比较好
1 Running work in hardware interrupt context
The
intended area of use is apparently code running from non-maskable
interrupts which needs to be able to interact with the rest of the
system.
2 Jump label
目的,取消关掉的测试,类似以前alternative的做法.
#define JUMP_LABEL(key, label) \
if (unlikely(*key)) \
goto label;
如何取消unlikely呢?
(1) note the location of the test and the key value in a special table, and (2) simply insert a no-op instruction.
A
call to enable_jump_label() will look up the key in the jump label
table, then replace the special no-op instructions with the assembly
equivalent of "goto label", enabling the tracepoint. Disabling the jump
label will cause the no-op instruction to be restored.
3 2.6.37 merge window, part 1
- The
x86 architecture now uses separate stacks for interrupt handling when
8K stacks are in use. The option to use 4K stacks has been removed.
- The
scheduler now works harder to avoid migrating high-priority realtime
tasks. The scheduler also will no longer charge processor time used to
handle interrupts to the process which happened to be running at the
time.
- The block layer can now throttle I/O bandwidth to specific devices, controlled by the cgroup mechanism
- Yet
another RCU variant has been added: "tiny preempt RCU" is meant for
uniprocessor systems. "This implementation uses but a single
blocked-tasks list rather than the combinatorial number used per leaf
rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and
greatly simplifies processing. This version also takes advantage of
uniprocessor execution to accelerate grace periods in the case where
there are no readers."
- A long list of changes to the
memblock (formerly LMB) low-level management code has been merged, and
the x86 architecture now uses memblock for its early memory management.
-
The default handling for lseek() has changed: if a driver does not
provide its own llseek() function, the VFS layer will cause all attempts
to change the file position to fail with an ESPIPE error. All in-tree
drivers which lacked llseek() functions have been changed to use
noop_llseek(), which preserves the previous behavior.
- The patch has been merged, with an associated API change. See the new Documentation/vm/highmem.txt file for details.
- Most of the work needed to from the block layer has been merged. This task will probably be completed before the closing of the merge window.
4 Resolving the inode scalability discussion
目标:去掉全局锁
The
global inode_lock is used within the virtual filesystem layer (VFS) to
protect several data structures and a wide variety of inode-oriented
operations.
Nick's patch set creates separate global locks
for some of those resources: wb_inode_list_lock for the list of inodes
under writeback, and inode_lru_lock for the list of inodes in the cache.
The standalone inodes_stat statistics structure is converted over to
atomic types. Then the existing i_lock per-inode spinlock is used to
cover everything else in the inode structure;
Al would like to
see the writeback locks taken prior to i_lock (because code tends to
work from the list first, prior to attacking individual inodes), but he
says the LRU lock should be taken after i_lock because code changing the
LRU status of an inode will normally already have that inode's lock.
Al has also the way he would like things to proceed
5Linux at NASDAQ OMX
operates exchanges all over the world - and they run on Linux.
Latency, throughput,reliability
To
meet these requirements, NASDAQ OMX runs large clusters of thousands of
machines. These clusters can process hundreds of millions of orders per
day - up to one million orders per second - with 250µs latency.
The
NAPI interrupt mitigation technique for network drivers has, on its
own, freed up about 1/3 of the available CPU time for other work. The
epoll system call cuts out much of the per-call overhead, taking 33µs
off of the latency in one benchmark. Handling clock_gettime() in user
space via the VDSO page cuts almost another 60ns. Bob was also quite
pleased with how the Linux page cache works; it is effective enough, he
says, to eliminate the need to use asynchronous I/O, simplifying the
code considerably.
弊端:
On the other hand, there are some
things which have not worked out as well for them. These include I/O
signals; they are complex to program with and, if things get busy, the
signal queue can overflow. The user-space libaio asynchronous I/O (AIO)
implementation is thread-based; it scales poorly, he says, and does not
integrate well with epoll. Kernel-based asynchronous I/O, instead, lacks
proper socket support. He also mentioned the recvmsg() system call,
which requires a call into the kernel for every incoming packet.
期望的方向:
The new recvmmsg() system call can receive multiple packets with a single call.
What
NASDAQ OMX would really like to see in Linux now is good socket-based
AIO. That would make it possible to replace epoll/recvmsg/sendmsg
sequences with fewer system calls. Even better would be if the kernel
could provide notifications for multiple events at a time.
The 2.6.36 kernel is out, released on October 20.
前四条没有再看的价值.
1 Dueling inode scalability patches
两种方案的争斗
2 IMA memory hog
设计上的不合理导致integrity measurement architecture (IMA)占据大量radix tree内存.
3 Shielding driver authors from locking
两
种观点的交锋, the appeal of taking care of locking for them and letting them
concentrate on getting their hardware to do reasonable things is clear,
especially if it makes the code review process easier as well. Such
efforts may ultimately be successful, but there can be no doubt that
they will run into disagreement from those who feel that kernel
developers should either understand what is going on or switch to Java
development.
4 A netlink-based user-space crypto API
如何User-space access to the kernel cryptography subsystem ? 一种新的方案.
5 trace-cmd: A front-end for Ftrace (重点)
Previous LWN articles have explained the basic way to use Ftrace directly through the debugfs filesystem ( and ). 本次介绍command-line tool that works with Ftrace
1 Linsched for 2.6.35 released
Linsched
is a user-space simulator intended to run the Linux scheduling
subsystem; it is intended to help developers working on improving (or
understanding) the scheduler.
2 No fanotify for 2.6.36 暂时屏蔽
3 ARM's multiply-mapped memory mess
The
ioremap() system call, used to map I/O memory for CPU use. 当用ioremap 映射
system memory(该内存已被正常映射), 行为未确定,但是当前很多ARM驱动依赖于此,不愿意修改.
4 Synaptics multitouch coming - maybe
又是专利权和逆向工程.
5 Statistics for the 2.6.36 development cycle
因为删除a bunch of defconfig files,
Perhaps
more interesting is this set of numbers: in 2.6.36, the development
community added 604,000 lines of code and deleted 651,000 - for a total
loss of almost 47,000 lines of code. This is the first time since the
beginning of the git era that the size of the kernel source has gone
down.
代码变更比例最小,
At 1.6% of the total, 2.6.36 represents a
relatively small piece of the total code base - the smallest for a long
time. Almost 29% of the kernel code still dates back to the beginning of
the git era(2.6.12), down from 31% last February. While much of our
kernel code is quite new - 31% of the code comes from 2.6.30 or newer -
much of it has also hung around for a long time.
1 Little-endian PowerPC
PPC是大端阵营的,at least some PowerPC processors can optionally be run in a little-endian mode.
为
什么需要?A number of GPUs, especially those aimed at embedded applications,
only work in the little-endian mode. 如果ppc工作在大端模式,则需用驱动和用户态程序做大量的工作,
Running the processor in little-endian mode will nicely overcome that
obstacle.
存在的问题:kernel有patch了,但是but there are toolchain changes required which are not, yet, generally available.
2 Trusted and encrypted keys
现状:
TPM-using
integrity measurement architecture (IMA), which can measure and attest
to the integrity of a running Linux system, has been part of the kernel
for some time now, the related extended verification module (EVM) has
not made it into the mainline. The existing IMA code only solves part of
the integrity problem, leaving the detection of offline attacks against
disk files (e.g. by mounting the disk under another OS) to EVM.
如何更好实现EVM:
Mimi Zohar's could also be used for other purposes such as handling the keys for filesystem encryption.
The basic idea is that these keys would be generated by the kernel, and
would never be touched by user space in an unencrypted form. Encrypted
"blobs" would be provided to user space by the kernel and would contain
the key material. User space could store the keys, for example, but the
blobs would be completely opaque to anything outside of the kernel. The
patches come with two new flavors of these in-kernel keys: trusted and
encrypted.
3 Two ABI troubles
一个是不应该修改已有的API,一个是tracepoint是否应该成为ABI.
4 Solid-state storage devices and the block layer
大量的技术细节,值得一看,参见
SSD的出现 显示了bottlenecks in the Linux filesystem and block layers.
未
有SSD之前: most I/O patterns are dominated by random I/O and relatively
small requests. Thus, getting the best results requires being able to
perform a large number of I/O operations per second (IOPS). With a
high-end rotating drive (running at 15,000 RPM), the maximum rate
possible is about 500 IOPS. Most real-world drives, of course, will have
significantly slower performance and lower I/O rates.
有了SSD后:SSDs,
by eliminating seeks and rotational delays, we have gone from hundreds
of IOPS to hundreds of thousands of IOPS in a very short period of time.
改进做法:
- 首先要识别SSD,因为硬件检测并不总是可靠,通过标志位/sys/block//queue/rotational
- 取
消或消弱SSD的 queue plugging.On a rotating disk, the first I/O operation to
show up in the request queue will cause the queue to be "plugged,"
meaning that no operations will actually be dispatched to the hardware.
The idea behind plugging is that, by allowing a little time for
additional I/O requests to arrive, the block layer will be able to merge
adjacent requests (reducing the operation count) and sort them into an
optimal order, increasing performance. Performance on SSDs tends not to
benefit from this treatment, though there is still a little value to
merging requests. Dropping (or, at least, reducing) plugging not only
eliminates a needless delay; it also reduces the need to take the queue
lock in the process.
- 改进request timeouts.老的实现involved a separate
timeout for each outstanding request, but that clearly does not scale
when the number of such requests can be huge. The answer was to go to a
per-queue timer, reducing the number of running timers considerably.
- 关
闭对熵的贡献queue/add_random .rotational disk拥有 inherently unpredictable
execution times,但是SDs lack mechanical parts moving around, so their
completion times are much more predictable.原因是add_timer_randomness() has
to acquire a global lock, causing unpleasant systemwide contention.
- 减
少锁的竞争 __make_request() is responsible for getting a request
(represented by a BIO structure) onto the queue. Two lock acquisitions
are required to do this job - three if the CFQ I/O scheduler is in use.
Those two acquisitions are the result of a lock split done to reduce
contention in the past; that split, when the system is handling requests
at SSD speeds, makes things worse. Eliminating it led to a roughly 3%
increase in IOPS with a reduction in CPU time on a 32-core system.
- drop
the I/O request allocation batching - a mechanism added to increase
throughput on rotating drives by allowing the simultaneous submission of
multiple requests.
- drop the allocation accounting code,
which tracks the number of requests in flight at any given time.
Counting outstanding I/O operations requires global counters and the
associated contention, but it can be done without most of the time.
- to
reduce contention keeping processing on the same CPU as often as
possible.the submission of a specific I/O request 和request's
completion尽量在一个CPU上进行,减少锁的bounce和slab对象在不同cpu的申请释放. 网络子系统的做法 this
problem has been addressed with techniques like ,
block I/O 的做法:controllers are not able to direct specific I/O
completion interrupts to specific CPUs. That solution took the form of
smp_call_function(), which performs fast cross-CPU calls. Using
smp_call_function(), the block I/O completion code can direct the
completion of specific requests to the CPU where those requests were
initially submitted. The result is a relatively easy performance
improvement.
- the
interrupt mitigation code 模拟NAPI的做法,The blk-iopoll code turns off
completion interrupts when I/O traffic is high and uses polling to pick
up completed events instead.
- "context plugging," a rework of
the queue plugging code,正在进行的工作. Currently, queue plugging is done
implicitly on I/O submission, with an explicit unplug required at a
later time. The plan is to make plugging and unplugging fully implicit,
but give I/O submitters a way to inform the block layer that more
requests are coming soon. It makes the code more clear and robust; it
also gets rid of a lot of expensive per-queue state which must be
maintained.
- 一种可能的技术(等待硬件支持),目前有许多问题:multiqueue block layer - an
idea which, once again, came from the networking layer. The creation of
multiple I/O queues for a given device will allow multiple processors
to handle I/O requests simultaneously with less contention. It's
currently hard to do, though, because block I/O controllers do not (yet)
have multiqueue support.