lwn.net kernel news 2012/5-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 622661
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2012/5

分类： LINUX

2012-06-22 12:52:28

· The , implementing Android-style opportunistic suspend (with a different API) has been merged. Associated with this work is a new epoll flag (EPOLLWAKEUP) which causes a wakeup event to be activated, preventing suspend when an event is available for processing.

· The gets the kernel closer to being able to safely run processes as root within a container.

· The tmpfs filesystem now supports hole punching and the SEEK_DATA and SEEK_HOLE lseek() options.

· The removal of old code continues; victims include Microchannel bus support, legacy CRIS RTC drivers, the imxmmc driver, the code, and the mechanism.

hanges visible to kernel developers include:

The kernel's exception table can now be sorted at build time, speeding the boot process somewhat.
The patch set, designed to make life easier on systems where large chunks of physically-contiguous memory are needed on occasion, has been merged at last.

See Documentation/trace/uprobetracer.txt for details

The perf tool has been enhanced to make working with dynamic user-space tracepoints easy.

atime的用途：管理员删除一些不常用的邮件以腾出空间。mutt email client 利用atime来判断whether a mailbox contains unread mail. Programs that clean up temporary directories (tmpreaper or tmpwatch, for example)

问题的根源：atime is broken. It turns reads into writes and is generally just nasty.

老问题背景以及解决办法: writing the last-accessed time ("atime") takes up a lot of I/O bandwidth when lots of files are being read; The worst of the atime-related problems have long since been mitigated by moving to the "relatime" mount option by default; relatime only updates atime a maximum of once per day for unchanging files. But now it seems that atime recording can be especially problematic with the btrfs filesystem, and relatime may not help much.

新的问题：snapshot特性的文件系统（如btrfs），首先从root开始snapshot，然后grep整个文件系统，所有的inode都要更新，导致大量空间被使用（COW作用）

解决办法：explicitly mount their filesystems with the "noatime" option.

Over 2,500 changesets were pulled into the mainline on the first day, and 4,600 have been merged as of this writing. It looks like it will be an interesting cycle with a lot of new stuff coming in and the removal of a bunch of old cruft. As of this writing, user-visible changes pulled for 3.5 include:

The , useful for the implementation of checkpoint/restart functionality, has been merged.
The networking stack has gained support for RFC 5827 early retransmit, a mechanism aimed at speeding recovery from packet loss.
The , which, hopefully, will be an important component in the solution to the bufferbloat problem, has been merged.
The has been merged; it allows processes to reduce the set of available system calls through the use of a mechanism based on the Berkeley packet filter. See Documentation/prctl/seccomp_filter.txt for details.
The Yama security module has two increasingly restrictive modes for controlling access to the PTRACE_ATTACH functionality.
The has been merged.
The NUMA scheduler has been rewritten with the result that it will make different, hopefully better scheduling decisions.
A lot of code has been removed in this development cycle, including the ixp2000 Ethernet driver, support for the sun4c SPARC CPU, the ip_queue netfilter module (superseded by nfnetlink_queue), all support for token ring networking, drivers for all MCA-based network cards, support for the protocol, support for ARMv3 processors, support for Intel IXP2xxx (XScale) processors, support for ST-Ericsson U5500 development boards, the Motorola 68360 serial port driver, and the workqueue tracer.

nonvolatile memory (NVM) promises bandwidth and latency numbers similar to those offered by dynamic RAM, and that, being cheaper than DRAM, it is likely to be offered in larger sizes than DRAM is.而且 memory would persist across a reboot—or a power-down.

Linux可能在原有的memory接口上进行扩展以支持NVM。

如何利用NVM? 各种cache的存放场所:bcache,page cache,inode cache,journals. Vyacheslav Dubeyko had about how NVM could eliminate system bootstrap entirely and make the concept of filesystems obsolete; instead, everything would just live in a persistent object environment.

Perf的历史遗留问题,为了保持ABI浪费了4个字节,3.6有望改变.

令牌环没有用户了，将从内核中移除

移动介质的ext文件系统（目前还很少，一般是vfat）的uid/gid与local host不匹配的解决办法。When a filesystem is mounted using these options, files retain their ownership on disk, but they appear to be owned by the specified user and group. Existing files cannot have their ownership changed, but new files will be created with the user and group given at mount time.

首先，printk转换成record，而api是面向流的，难以处理续行问题。一个办法是追踪不同的信息来源（不同的进程），merge来自相同进程的续行，但是依然无法处理theads之间的race condition。

其次，printk加上了时间戳。 [May12 11:27] foo

[May12 11:28] bar

[ +5.077527] zoot

[ +10.235225] foo

[ +0.002971] bar

[May12 11:29] zoot

[ +0.003081] foo

In other words, events that are relatively far apart in time would be marked with the absolute time with one-minute precision. When things happen more closely in time, the elapsed time between successive events would be printed instead.

Bache是SSD-based cache，基于page cahe和hard disk之间。对于读能极大提升性能。但是如果缓存写，将引入很大的复杂性。Write-through情况bcache起不到相应作用，而Wirte back方式中途掉电会要求重启后将SSD未写回的数据写回，导致大量复杂代码。还有不支持barrier导致日志文件系统无法使用该特性。还有DIRECT I/O将导致数据不一致，所以两者是互斥的。

该特性比较复杂，进入mainline可能还需要时间。

非常值得一看。

Kathleen Nichols and Van Jacobson have published describing a new network queue management algorithm that, it is hoped, will play a significant role in the solution to the bufferbloat problem.

One of the key insights in the design of CoDel is that there is only one parameter that really matters: how long it takes a packet to make its way through the queue and be sent on toward its destination. And, in particular, CoDel is interested in the minimum delay time over a time interval of interest. If that minimum is too high, it indicates a standing backlog of packets in the queue that is never being cleared, and that, in turn, indicates that too much buffering is going on. So CoDel works by adding a timestamp to each packet as it is received and queued. When the packet reaches the head of the queue, the time spent in the queue is calculated; it is a simple calculation of a single value, with no locking required, so it will be fast.

Less time spent in queues is always better, but that time cannot always be zero. Built into CoDel is a maximum acceptable queue time, called target; if a packet's time in the queue exceeds this value, then the queue is deemed to be too long. But an overly-long queue is not, in itself, a problem, as long as the queue empties out again. CoDel defines a period (called interval) during which the time spent by packets in the queue should fall below target at least once; if that does not happen, CoDel will start dropping packets. Dropped packets are, of course, a signal to the sender that it needs to slow down, so, by dropping them, CoDel should cause a reduction in the rate of incoming packets, allowing the queue to drain. If the queue time remains above target, CoDel will drop progressively more packets. And that should be all it takes to keep queue lengths at reasonable values on a CoDel-managed node.

The target and interval parameters may seem out of place in an algorithm that is advertised as having no knobs in need of tweaking. What the authors have found, though, is that a target of 5ms and an interval of 100ms work well in just about any setting.

there is now available

2 Statistics from the 3.4 development cycle

As of this writing, Linus has merged just over 10,700 changes for 3.4; those changes were contributed from 1,259 developers. The total growth of the kernel source this time around is 215,000 lines.

目标是一个kernel启动所有的ARM平台，必须用device tree取代原先所有的board file，目前有很大进展，但是必须保留不支持device tree的平台。

board files have a number of tasks:

Define any system-specific functions and setup code.
Create a description of the available peripherals, usually through the definition of a number of platform devices.
Create a special machine description structure that includes a magic number defined for that particular system. That number must be passed to the kernel by the bootloader; the kernel uses it to find the machine description for the specific system being booted.

1 Some useful perf documentation

posted by Google

目前： The way that balancing is done in current kernels is relatively straightforward: the active list is not allowed to grow larger than the inactive list. The inactive > active rule is only enforced during reclaim, we don't mind the list sizes on idle systems.

patch前景未明： the kernel's radix tree implementation already has a concept of that is used to track tmpfs pages while they are swapped out.patch利用"exceptional entries"来记录页面evicted的时间，当触发fault时，就可以知道页面逐出多久，利用该时间来调整active/inactive list的大小。

时，依然可以恢复网络连接。大部分在用户态完成，少部分需要内核支持。

最早见

为维护用户态程序内核的一个努力，一句话，开始的ABI没弄好后面的兼容性支持害死人。

阅读(1355) | 评论(0) | 转发(0) |

上一篇：雪泥鸿爪－教学中的debug(6)

下一篇：mount选项noatime和relatime

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6