lwn.net kernel news 2011/10-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 613984
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/10

分类： LINUX

2011-11-09 11:16:25

rt-tree的大量补丁似乎有希望进入3.2-3.4
Per-CPU data和前面的说法有所不同，rt-tree最早的做法是用锁取代抢占，后来用this_cpu_read() and this_cpu_write()，两者速度很快，但是也存在问题：
The first problem is that there is no locking mechanism built into this API; it does not even disable preemption.这样的话跨cpu对per-cpu data操作
The bigger problem, though, is that the API does not in any way indicate where the critical section involving the per-CPU data lies in the code,难于加入调试和测试机制。
新的方法还为得到验证

Software interrupts原来的做法 split software interrupt ("softirq") handling into a separate thread, making it preemptable like everything else. 后来发现导致网络处理延迟，又回归mainline方式。现在有建议采用hybrid方式，其他的softirq线程化，而网络的则保留原有方式。
另外一个可能：moving to threaded handlers in the storage subsystem as well;

另外一个技术细节：
Sampling of randomness is disabled in the realtime tree; otherwise contention for the entropy pool causes unwanted latencies. as processors add hardware entropy generators, it's not clear that maintaining the software pool has value.

The future

Deadline scheduling.
CPU isolation - the ability to run one or more processors with no clock tick or other overhead.

历史 kernel.org, which was initially set up when Linus moved to California and started working for Transmeta. It was split off as a standalone nonprofit organization in 2002. Running kernel.org had never been anybody's full-time job; it was an all-volunteer effort until 2008, when the Linux Foundation hired John Hawley as its first full-time administrator.
Kernel.org is being rebuilt from the beginning with a much greater separation of services; it will also be moving fully into the Linux Foundation.
For email, only forwarding will be supported. Most mailing lists that were hosted there will move to vger.kernel.org; only a few, like security@kernel.org, will remain.vger is maintained by Red Hat's IT group。
The old kernel.org would automatically sign tarballs and other files after they were uploaded.developers will need to sign files - with their own keys - before uploading them.

Google对tracing的需求, buffer size可调整,容纳更多的数据, trace更长的时间.Taking some unnecessary fields out of the event header - things like the preemption count and lock depth

但是关于tracing ABI依然没有取得共识

· ; a presentation (only partly delivered) on improving kernel error logging. 似乎也没有达成共识

· ;

选择potential control group maintainer以便更好地管理规划, 期冀widely-used applications for controllers, 更好地interaction with namespaces.

· ;

a list of patches (基本在以前的kernel new提到过)that are in need of review and rework with an eye to eventually merging them.值得注意的吴峰光的patch可能进入3.2.

the kernel has gotten smarter about how it uses the congestion_wait() functionality, which is a big hammer to use when trying to control writeback. there will be no more writeback done from direct reclaim - news that was received with applause.

this_cpu_*被强烈批评(见前), Linus responded that he has no problem with a per-CPU data lock that disappears in mainline kernels and allows verification of locking with lockdep. But, he said, calling it a new big kernel lock is a bit unfair;

· ;.

a reworked version of the LinSched scheduling simulator was discussed and proposed to be added to the kernel tools/ directory. . It may provide the long-sought ability to more reliably test scheduler changes.

· ;

in a wide-ranging discussion, various problems in the area of patch review were covered,没有明确的结论, 令人意外的是由于市场的压力Android的代码可能会大量涌入内核

Another issue raised by Ted is ensuring that problems raised in previous reviews have been addressed in a new revision of a patch set. Tools like Gerrit can be most helpful in this regard. With Gerrit, it is possible to look at the changes in a patch over time and to track the comments that were made.

· ;

Linus is happy with how things are going, overall, though the growth of complexity in the kernel is somewhat worrisome.

Day 2

· : reports from a large number of minisummits, more on kernel.org security and the web of trust, and regression tracking.

· : shared libraries, failure handling, the media controller, the kernel build and configuration subsystem, and the future of the event itself.

Writing sane shared libraries

Lennart Poettering and Kay Sievers, it seems, have grown tired of dealing with the mess that results from kernel developers trying to write low-level user-space libraries. So they proposed and ran a session intended to convey some best practices. For the most part, their suggestions were common sense:

"Use automake, always." Nobody wants to deal with the details of writing makefiles. Automake is ugly, but the ugliness can be ignored;
Licensing: they recommended using LGPLv2 with the "or any later version" clause.
Never have any global state. Code should also be thread aware, "but not thread-safe." Thread-level locking can create problems at fork() time, it is best avoided, especially in low-level libraries. GCC constructors should be avoided for the same reason.
Files should be opened with O_CLOEXEC, always. There is no telling when another thread might do something and carry off a library's file descriptors with it.
Basic namespace hygiene: no exporting variables to applications, use prefixes on all names, and use versioned symbols for all drop-in library use. It is also best to use naming conventions that application developers will expect.
No structure definitions in header files; they will only cause trouble when the ABI evolves in the future.

1
内核DM和MD子系统都有完整的实现，Boaz Harrosh 的号称有nice, general-purpose RAID library，已有的实现可以利用该库，但是前途未卜

2
老话题，提出了一个新方法。Neil Brown 认为应该由一个用户态daemon完成该工作，对suspend关心的进程和daemon进行沟通，该方法前景不明。

3 Timer slack for slacker developers
The allows a suitably privileged process to set the timer slack value for every process contained within a control group. 目前难以进入内核。
最强烈的反对理由是这种做法纵容质量差的程序，使其懒于改进.另外，该patch难于处理一些特殊程序。

4
Łukasz Sowa recently for a different mechanism to restrict syscalls，seccomp的竞争者，利用control group，前景未明。

A container is a way to isolate a group of processes from the rest of a running Linux system. By using namespaces, that group can have its own private view of the OS—though, crucially, sharing the same kernel with whatever else is running—with its own PID space, filesystems, networking devices, and so on..

希望解决的问题: detecting whether a process is running in a child PID namespace. A number of different things that are not "virtualized" by namespaces, including sysfs, /proc/sys, SELinux, udev, and more. 这样可以run a standard distribution unmodified in a container, 但是目前极缺的功能是.

Ted Ts'o 质疑该想法,认为VM已经足够,上述的做法只会导致container重量级化.

Biederman 认为还有许多平台不支持KVM, 所以container有用武之地.

2 Securely deleting files from ext4 filesystems

要求从磁盘上彻底删除,而不是仅仅删掉目录项. chattr 支持 "secure delete" functionality 但是most filesystems do not actually honor that flag; ext4 secure delete patch set provides "secure delete" functionality.

实现:

l 同步删除数据块,不考虑hole. 一般情况填充0, 有些硬件支持"secure discard" feature supported by some devices (solid-state disks, primarily). Secure discard handles the deletion internally to the device - perhaps just by marking the relevant blocks unreadable until something else overwrites them - eliminating the need to perform extra I/O from the kernel.

l 清除所有的元数据: directory entry. Associated metadata - extended attributes, access control lists, etc

l 相关的日志: 首先synchronous journal flush,保证没有活跃的日志, the (now old and unneeded) data in the journal can be cleaned up. The only problem is that the journal does not maintain any association between its internal blocks and the files they belong to in the filesystem. The patch addresses that problem by adding a new data structure mapping between journal blocks and file blocks; the secure deletion code then traverses that structure in search of blocks in need of overwriting.

还有一些问题,暂时无法进入mainline.

btrfsck 仍为发布,导致btrfs不能进入正式使用引发的职责. Chris的理由是为了使其更可靠时再发布, 但是不能赞同.

1
在回复贴里面有更详细地介绍
PaX project's use of plugins. It seems they have developed a set of plugins to help with the task of creating a more secure kernel. the grsecurity "test" patch set.

Use of plugins in this way allows significant changes to be made to the kernel without actually having to change the code.

Structures containing only function pointers are made const, regardless of whether they are declared that way. Of course, it turns out that this is the wrong thing to do in a number of cases, so the developers had to create a no_const attribute and use it some 180 places in their patch.
A histogram of the distribution of sizes passed to kalloc() is generated; it's not clear (to your editor) what use is made of that information.
Some fairly sophisticated tweaks to the generated assembly are made for AMD processors to improve the prevention of the execution of kernel data.
Instrumentation is inserted to track kernel stack usage.

第一个问题我认为应该给内核打一个大补丁

The return of the other trees is waiting for the relevant developers to reestablish their access to the site - a process that involves developers of their own systems, then , integrating it into the web of trust, and forwarding the public key to the kernel.org maintainers.

The site's administrators have already announced that shell accounts will not be returning to the systems where git trees are hosted.

hrtimer users can usually accept a wakeup within a specific range of times, though. To take advantage of that fact, the kernel offers "range hrtimers" with both soft (earliest) and hard (latest) deadlines. With range hrtimers, the kernel can coalesce wakeup events, minimizing the number of interrupts and reducing power usage.

当前的问题：One would think that, once the hrtimer code starts running in response to a timer interrupt, it would make sense to run every timer event whose soft expiration time has passed（目前只保证hard ，soft expiration只是尽可能合并执行，因为red back tree是按hard expiration组织的，无法保证快速找到所有的soft expiration）. But that is not what current kernels do. from Venkatesh Pallipadi changes that behavior.

解决办法：尽早执行不一定是好事情（例如重负载时）Venkatesh's patch avoids that issue by only performing the greedy hrtimer walk if the CPU is idle when the timer interrupt happens. If work is being done, soft-expired timers that are not immediately accessible are left in the tree, but, if the CPU has nothing better to do, it performs the full search.

The greedy hrtimer walk patch turns the hrtimer tree into an augmented red-black tree; each node then stores the earliest soft expiration time to be found at that level of the tree or below.

4 （重要）

Google的Michel Lespinasse's .很可能会merged

当前问题：The memory controller can put limits on how much memory each group of processes can use, but it is unable to automatically vary those limits in response to the actual need shown by those groups.

目标：

Google would like to get a better handle on how much memory each group actually needs so that the limits can be adjusted on the fly - responding to changes in load. Michel's patch set tries to track the number of idle pages in each control group.

平台：64位（由于page flags有限）

实现：

l a new kernel thread running under the name kstaled. Its job is to scan through all of memory (once every two minutes by default) and count the number of pages that have not been referenced since the previous scan. Such pages are deemed to be idle; each one is traced back to its owning control group and that group's statistics are adjusted.

l the patch introduces the notion of "stale" pages: a page is stale if it is clean and if it has been idle for more than a given (administrator-defined) number of scan cycles. The presence of stale pages indicates that a control group is not under serious memory pressure. If that control group's memory needs suddenly increase, though, the kernel will start reclaiming those stale pages. So a sudden drop in the number of stale pages is a good indication that something has changed.

阅读(821) | 评论(0) | 转发(0) |

上一篇：lwn.net kernel news 2011/9

下一篇：Range Minimum Query and Lowest Common Ancestor笔记

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6