lwn.net kernel news 2011/5-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 622585
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/5

分类： LINUX

2011-10-19 16:22:07

Wireless networking hacker Luis Rodriguez has put together aimed at developers writing and supporting wireless drivers.

2
integrating it with the perf/ftrace 遭到Peter Zijlstra和Thomas Gleixner的强烈反对，很难进入mainline.

3

There are two new POSIX clock types: CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM; they can be used to set timers that will wake the system from a suspended state. See for more information on these new clocks.
The has been added to the network stack.
The has been merged; only x86-64 is supported for now.
There is a new networking system call:

int sendmmsg(int fd, struct mmsghdr *mmsg, unsigned int vlen, unsigned int flags);

It is the counterpart to recvmmsg(), allowing a process to send multiple messages with a single system call.

The feature has been merged; its main purpose is to allow unprivileged programs to send echo-request datagrams.
Two new sysctl knobs allow the capabilities given to user-mode helpers invoked by the kernel to be restricted; see for details.
The tmpfs filesystem has gained support for extended attributes.
The Xen block backend driver (allowing guests to export block devices to other guests) has been merged.
Prefetching is no longer used in linked list and hlist traversal;
There is a new strtobool() function for turning user-supplied strings into boolean values:

int strtobool(const char *s, bool *res);

Anything starting with one of [yY1] is considered to be true, while strings starting with one of [nN0] are false; anything else gets an -EINVAL error.

There is a whole series of new functions for converting user-space strings to kernel-space integer values; all follow this pattern:

int kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res);

These functions take care of safely copying the string from user space and performing the integer conversion.

The kernel has a new generic binary search function:

void *bsearch(const void *key, const void *base, size_t num, size_t size, int (*cmp)(const void *key, const void *elt));

The use of threads for the handling of interrupts on specific lines can be controlled with irq_set_thread() and irq_set_nothread().
The has been merged.
The function tracer can now support multiple users with each tracing a different set of functions.
The alarm timer mechanism - which can set timers that fire even if the system is suspended - has been merged.
An object passed to kfree_rcu() will be handed to kfree() after the next read-copy-update grace period. There are a lot of RCU callbacks which only call kfree(); it should be able to replace those with kfree_rcu() calls.
The -Os (optimize for size) option is no longer the default for kernel compiles; the associated costs in code quality were deemed to be too high.
The first rounds of ARM architecture cleanup patches have gone in. A number of duplicated functionalities have been consolidated, and support for a number of (probably) never-used platform and board configurations have been removed.
The W= parameter to kernel builds now takes values from 1 to 3.

4
在上周遭到反对的情况，进一步加强防止内核地址泄露。Dan 的工作
The default values of the kptr_restrict and dmesg_restrict sysctls are set to (1) when this is enabled, since hiding kernel pointers is necessary to preserve the secrecy of the randomized base address.前景不明，个人认为是值得的。

5
Andi Kleen may have been the first to question this optimization when he last September. 经过Linus和Ingo的测试证明Andi是正确的。

the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September.

在现代处理器中，硬件优化往往比软件优化做得更好。比如链表和hash table，经常prefetch 空指针（这是愚蠢的开销，20 cycles）。其它情况也不见得占上风

conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.

1 launches another new filesystem HAMMER2

2
为安全起见，a patch was applied to censor kernel addresses appearing in /proc/kallsyms and /proc/modules.地址信息以0表示.

Anybody wanting pointer hiding should turn it on by setting the kernel.kptr_restrict knob to 1.

3 (重要)

现状:

Current kernels actually maintain five LRU lists. There are separate active and inactive lists for file-backed pages. There are separate active and inactive lists for anonymous pages - reclaim policy for those pages is different, and, if the system is running without swap, they may not be reclaimable at all. There is also a list for pages which are known not to be reclaimable - pages which have been locked into memory, for example. Oh, and it's only fair to say that one set of those lists exists for each memory zone.

实现memory control group添加的成份

memory controller adds a new page_cgroup structure for each page; When memory control groups are active, there is another complete set of LRU lists maintained for each group. The list_head structures needed to maintain these lists are kept in the page_cgroup structure.

产生的问题: Global reclaim uses the global LRU as always, so it operates in complete ignorance of control groups. It will reclaim pages regardless of whether those pages belong to groups which are over their limits or not. Per-control-group reclaim, instead, can only work with one group at a time; as a result, it tends to hammer certain groups while leaving others untouched. The multiple LRU lists are not just complex, they are also expensive. A list_head structure is 16 bytes on a 64-bit system. If that system has 4GB of memory, it has 1,000,000 pages, so 16 million bytes are dedicated just to the infrastructure for the per-group LRU lists.

最新的补丁:

The from Johannes Weiner represent an attempt to create that improvement by better integrating the memory controller with the rest of the virtual memory subsystem. At the core of this work is the elimination of the duplicated LRU lists. In particular, with this patch set, the global LRU no longer exists - all pages exist on exactly one per-group LRU list. Pages which have not been charged to a specific control group go onto the LRU list for the "root" group at the top of the hierarchy. In essence, per-group reclaim takes over the older global reclaim code; even a system with control groups disabled is treated like a system with exactly one control group containing all running processes.

Algorithms for memory reclaim necessarily change in this environment. The core algorithm now performs a depth-first traversal through the control group hierarchy, trying to reclaim some pages from each. There is no global aging of pages; each group has its oldest pages considered for reclaim regardless of what's happening in the other groups. Each group's hard and soft limits are considered, of course, when setting reclaim targets. The end result is that global reclaim naturally spreads the pain across all control groups, implementing each group's policy in the process. The implementation of control group soft limits has been integrated with this mechanism, so now soft limit enforcement is spread more fairly across all control groups in the system.

Johannes's patch improves the situation while shrinking the code by over 400 lines; it also gets rid of the memory cost of the duplicated LRU lists.

4
纲领性介绍，无关细节
Code organization

驱动的情况
“Not in the arch/arm directory!” Drivers should move to the appropriate subdirectory of the top-level drivers tree.

以下三者已变成通用代码，移出ARM tree

But what about non-driver code? Where should it live? It is helpful to look at several examples: (1) the struct clk code that Jeremy Kerr, Russell King, Thomas Gleixner, and many others have been working on, (2) the device-tree code that Grant Likely has been leading up, and (3) the generic interrupt chip implementation that Thomas Gleixner has been working on.

The struct clk code is motivated by the fact that many SoCs and boards have elaborate clock trees. These trees are needed, among other things, to allow the tradeoff between performance and energy efficiency to be set as needed for individual devices on that SoC or board. The struct clk code allows these trees to be represented with a common format while providing plugins to accommodate behavior specific to a given SoC or board. The has a similar role, but with respect to interrupt distribution rather than clock trees.

are intended to allow the hardware configuration of a board to be represented via data rather than code, which should ease the task of creating a single Linux kernel binary that boots on a variety of ARM boards.

The struct clk code is already used by both the ARM and SH CPU architectures, so it is not ARM-specific, but rather core Linux kernel code. Similarly, the device-tree code is not ARM-specific; it is also used by the PowerPC, Microblaze, and SPARC architectures, and even by . Device tree therefore is also Linux core kernel code. The virtual-interrupt code goes even further, being common across all CPU architectures.

特定于ARM的代码
There will of course need to be at least some ARM-specific code, but the end goal is for that code to be limited to ARM core architecture code and ARM SoC core architecture code. Furthermore, the ARM SoC core architecture code should consist primarily of small plugins for core-Linux-kernel frameworks, which should in turn greatly ease the development and maintenance of new ARM boards and SoCs.

Git trees

目前有两个tree
Nicolas's existing git tree is an that allows people to easily pull the latest and greatest ARM code against the most recent mainline kernel version. In contrast, a contains patches that are to be upstreamed, normally based on a more-recent mainline release candidate.

概念:

the platform problem comes about when developers view the platform they are developing for as fixed and immutable. These developers feel that the component they are working on specifically (a device driver, say) is the only part that they have any control over. If the kernel somehow makes their job harder, the only alternatives are to avoid or work around it.

做为自由软件,我们可以修改让我们不爽的模块,而不是去适应它.但是这一点对驱动开发者很难做到

reach across arbitrary module boundaries and fix problems encountered in other parts of the system. We don't have to put up with bugs or inadequate features in the code we use; we can make it work properly instead.

(formerly LinuxBIOS) is a free BIOS implementation; AMD将在一系列的处理器上支持它。

2
还是ABI的争论，一个已经没用的成员不能废除，因为影响了powertop.
Steven Rostedt认为应该使用tracing/events/sched/sched_switch/format之类的元数据确定成而不是员，dependent on the binary format of the trace data exported from the kernel.

Steven Rostedt和 Ingo发生了一场辩论，Ingo认为应该废除ftrace，只使用perf，Steven坚持保留.Ingo是一个擅长摘果子的选手

3 2.6.39 development statistics
There have been just over 10,000 non-merge changesets merged for 2.6.39; with the sole exception of 2.6.37 (11,446 changesets), that's the highest since 2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with 1,276 developers) has ever exceeded that number. Those developers added 670,000 lines of code while deleting 346,000 lines, for a net growth of 324,000 lines.

4 (重要)
问题背景：
The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.

In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.

bug的出现：Some devices can perform , meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error.RAID也有类似的情况

解决方案：

早些时间的 in February. In situations where integrity checking was in use, the kernel would make a copy of each page before beginning a writeback operation. 但是开销比较大
any attempt to write to a page which is under writeback will simply wait until the writeback completes. There is no need to copy pages or engage in other tricks,代价就是applications which repeatedly write to the same part of a file性能下降10-15%

实现细节
a page will be marked read-only when it is written back; there is also a page flag which indicates that writeback is in progress. So all of the pieces are there to trap writes to pages under writeback. To make it even easier, the VFS layer already has a callback (page_mkwrite()) to notify filesystems that a read-only page is being made writable;

控制组的dcache用量，目前还未进入mainline

Pavel Emelyanov's are a first attempt at limiting dentry use. This patch works by organizing dentries into "mobs," being groups of dentries all of which represent names in a specific subtree of the filesystem.

问题背景

The performance monitoring unit (PMU) is normally associated with the CPU; each processor has its own PMU for monitoring its own specific events. Some newer processors (such as Intel's Nehalem series) also provide a PMU which is not tied to any CPU; in the Nehalem case it's part of the "uncore" which handles memory access at the package level.

问题：是否应该支持下面的raw events（magic, CPU and model specific incantations）

perf stat -e r1b7:20ff -a sleep 1

结论：raw events还会支持。 “The kernel has no business telling users which perf events are interesting, or limiting them!”，"generalized events"应该在用户态还是在核心态完成是个有趣的问题。

Sandboxing processes such that they cannot make "dangerous" system calls is an attractive feature that has already been implemented in a limited way for Linux with seccomp. Two years ago, we to allow more fine-grained control over which system calls would be allowed.

Seccomp (from "secure computing") is enabled via a prctl() call and, once enabled, restricts the process from making any further system calls beyond read(), write(), exit(), and sigreturn()—any other system call will abort the process.

进展：a from Will Drewry ，利用 to make the interface even more flexible by allowing filters to be applied to the system call arguments. Essentially, that would make for three choices for each system call: enabled, disabled, or filtered.

Clang static analyzer很强大，能够帮助发现一些错误。

vast majority of GCC extensions are supported by Clang，A bigger problem is that Clang lacks support for variable-length arrays in structures (VLAIS). There are some GCC extensions that aren't implemented, however, including explicit register variables.

Code generation and optimization problems

There are several code generation and optimization options for GCC that aren't supported by Clang. One of those is -mregparm that governs the number of registers used to pass integer arguments.

Also, -fcall-saved-reg is not supported by Clang and that affects the uses of the ALTERNATIVE() macro in the kernel

Another problem is with -pg, which enables instrumentation code for function calls in GCC, and is used when building Ftrace.

The final problem that Lelbach mentioned is the -fno-optimize-sibling-calls flag that is not supported by Clang. The flag disables tail call elimination, and the kernel introspection code (like Ftrace) assumes specific stack depths in various places.

阅读(788) | 评论(0) | 转发(0) |

上一篇：Xen and the Beauty of Virtualization笔记

下一篇：中文期刊的一篇烂论文-基于Intel vT-x的XEN全虚拟化实现

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6