lwn.net kernel news 2011/9-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 622650
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/9

分类： LINUX

2011-11-08 21:24:18

本期无可读性。

John Stultz has just proposed

Suspend blockers are a way for either the kernel or user space to tell the system that it is not a good time to suspend; To work properly, suspend blockers must be supported by any device that can wake up the system. Drivers for such devices will, when a wakeup event occurs, acquire a suspend blocker and wake any user-space process waiting on the event; once that process reads the event, the suspend blocker will be released. The key is that said user space process, if it is sufficiently privileged, can acquire a suspend blocker of its own before reading the event that woke it up.

原来的方案是隐式的，patch的API是显式的

His patch adds a new scheduler option:

sched_setscheduler(0, SCHED_STAYAWAKE, ¶m);

Any time that a process has been marked in this way, the kernel simply will not suspend the system.

Control groups的用户越来越多，例如Control groups help in the implementation of containers by isolating groups of processes from each other and by allowing the imposition of resource limits on each group.

Google uses its own form of containers， Containers let Google place limits on the CPU usage, memory usage, I/O bandwidth consumption, etc. of each group of processes on the system.

关于memory controller的问题

l Nested control groups的支持At the moment, Google is using the "fake NUMA" feature to partition system memory and parcel it out as needed. Fake NUMA is a hack, though, with resource costs of its own. They are moving to the kernel's memory controller, but it is not yet suitable for their needs because it cannot work with nested control groups. They had similar problems with the disk bandwidth controller, but that problem recently.

l 记账的问题。Currently shared pages are billed to the control group that touches it first. 另外一种方式是按比例分摊，但是google希望manually arrange for pages backed by certain files to be billed to specific groups. Then he could set up a system group to be billed for, say, the C library.

l 缺乏以下功能 a way to query the size of the working set for each control group, but that capability is not currently there. per-control-group reclaim to focus the memory management code on the control groups that are currently exceeding their limits. And, if a container goes so far over its limits that the out-of-memory killer gets involved, it would be really nice to have a way to kill a whole control group at once instead of having to do it one process at a time.

OMAP引发的问题：

Tomi Valkeinen's posed by the display system found on OMAP processors. Instead of having a "video card," the OMAP has, on one side, an acceleration engine that can render pixels into main memory and, on the other, a "display subsystem" connecting that memory to the video display. That subsystem consists of a series of overlay processors, each of which can render a window from memory;

OMAP graphics depends on a set of interconnected components. Filling video memory can be done via the framebuffer interface, via the direct rendering (DRM) interface, or, for video captured from a camera, via the Video4Linux2 overlay interface. Video memory must be managed for those interfaces, then handed to the display processors which, in turn, must communicate with the panel controller.

将来的趋势：

l So most developers seem to believe that, over time, DRM should become the interface for mode setting and memory management, while the older framebuffer interface should become a compatibility layer over DRM until it fades away entirely.

l Video4Linux2 overlay 将会重写

l The complexity of video acquisition devices has reached a point where treating them as a single device no longer works well; thus the media controller, which allows user space to query and change the connections between a pipeline of devices. The media controller could be useful for controlling display pipelines as well.

The purpose of dm-verity is to implement a device mapper target capable of validating the data blocks contained in a filesystem against a list of cryptographic hash values. If the hash for a specific block does not come out as expected, the module assumes that the device has been tampered with and causes the access attempt to fail. At the core of this new facility is a module called dm-bht, which works with a list of block numbers and their associated hash values. This list is organized into a simple tree for quick access to the hashes for arbitrary blocks.

和EVM的区别

. EVM requires and uses a trusted platform module (TPM) on the system to be verified; as long as the initial boot step can be secured, dm-verity is able to work without a TPM. It also seems likely that dm-verity will be faster since it does on-demand verification of blocks; there is no need to verify entire files before the first block can be accessed.

用途：dm-verity will make it easier to create locked-down Linux-based systems that will enforce whatever DRM requirements the movie studios may see fit to impose.

1 Kernel development without kernel.org

Kernel.org的漏洞导致开发受到影响 The loss of kernel.org has slowed things enough to make it clear that the process has a single point of failure built into it. Whether that is worth fixing is not entirely clear;

2 (重点)

2.1 Proportional rate reduction

RFC 3517“congestion window”对于偶然的丢包立即减半，导致可发包量锐减，导致不必要的延迟。

目前的做法：Linux does not use strict RFC 3517 now; it uses, instead, an enhancement called "rate halving." With this algorithm, the congestion window is not halved immediately. Once the connection goes into loss recovery, each incoming ACK (which will typically acknowledge the receipt of two packets at the other end) will cause the congestion window to be reduced by a single packet. Over the course of one full set of in-flight packets, the window will be cut in half, but the sending system will continue to transmit (at a lower rate) while that reduction is happening. The result is a smoother flow and reduced latency.

改进的做法：

The proportional rate reduction algorithm takes a different approach. The first step is to calculate an estimate for the amount of data still in flight, followed by a calculation of what, according to the congestion control algorithm in use, the congestion window should now be. If the amount of data in the pipeline is less than the target congestion window, the system just goes directly into the TCP slow start algorithm to bring the congestion window back up. Thus, when the connection experiences a burst of losses, it will start trying to rebuild the congestion window right away instead of creeping along with a small window for an extended period.

If, instead, the amount of data in flight is at least as large as the new congestion window, an algorithm similar to rate halving is used. The actual reduction is calculated relative to the new congestion window, though, rather than being a strict one-half cut. For both large and small losses, the emphasis on using estimates of the amount of in-flight data instead of counting ACKs is said to make recovery go more smoothly and to avoid needless reductions in the congestion window.

More information can be found in .

2.2 TCP fast open

send data with the handshake packets 对 simple transactions能极大减少延迟。

解法的办法

l creation of a per-server secret which is hashed with information from each client to create a per-client cookie. That cookie is sent to the client as a special option on an ordinary SYN-ACK packet; the client can keep it and use it for fast opens in the future. The requirement to get a cookie first is a low bar for the prevention of SYN flood attacks

l 修改API On the client side, the sendto() system call is used to request a fast-open connection; with the new MSG_FAST_OPEN flag,. On the server side, a setsockopt() call with the TCP_FAST_OPEN option will enable fast opens

2.3 Briefly: user-space network queues

加速本地的包处理，a variant of the network channels concept, where packet processing is pushed as close to the application as possible.（利用智能网卡特性可以直接绕过software IRQ）

those who are interested can find the patches on .

依旧是ARM的问题，没有任何新意。

"Bufferbloat" is the problem of excessive buffering used at all layers of the network, from applications down to the hardware itself. Large buffers can create obvious latency problems.

Excessive buffering wrecks the control loop that enables implementations to maximize throughput without causing excessive congestion on the net.

问题的根源

The initial source of the problem, Jim said, was the myth that dropping packets is a bad thing to do combined with the fact that it is no longer possible to buy memory in small amounts. The truth of the matter is that the timely dropping of packets is essential; that is how the network signals to transmitters that they are sending too much data.

处理的办法：整个网络的各个环节都要处理。

A real solution to bufferbloat will have to be deployed across the entire net.

目前有项目在做这方面的工作，并且融入了大量的网络新特性。

硬件开发商和内核社区的矛盾。

In summary: trying to maintain a single driver for multiple operating systems may look like a good idea on the surface. But it is only sustainable in a world where the vendor keeps complete control over the code. Even then, it leads to worse code, duplicated effort, long-term maintenance issues, and more work overall. Linux works best when its drivers are written for Linux and can be fully integrated with the rest of the kernel. The community's developers understand this well; that is why multi-platform drivers have a hard time getting into the mainline.

the (功能简单，无mmu) from Texas Instruments and the （可以匹敌通用芯片）from Qualcomm

3 (重点)

Application buffer==> library buffer==> kernel buffer==> disk volatile cache==>disk stable storage

Fwrite 系列函数写入library buffer

Fflush causing the data to move into the "Kernel Buffers" layer

Fsync the data is saved to the "Stable Storage" layer

There are two flags that can be specified when opening a file to change its caching behavior: O_SYNC (and related O_DSYNC), and O_DIRECT. I/O operations performed against files opened with O_DIRECT bypass the kernel's page cache, writing directly to the storage. Recall that the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage. The O_DIRECT flag is only relevant for the system I/O API（指操作FILE *的系列）.

Raw devices (/dev/raw/rawN) are a special case of O_DIRECT I/O. These devices can be opened without specifying O_DIRECT, but still provide direct I/O semantics. As such, all of the same rules apply to raw devices that apply to files (or devices) opened with O_DIRECT.

O_SYNC: File data and all file metadata are written synchronously to disk.

O_DSYNC: Only file data and metadata needed to access the file data are written synchronously to disk.

注意，调用fsync之前一定要保证fflush被调用。

When Should You Fsync?

如何创建新文件

A newly created file may require an fsync() of not just the file itself, but also of the directory in which it was created (since this is where the file system looks to find your file). This behavior is actually file system (and mount option) dependent. 加fsync是保险的做法

如何更新文件

The following steps are required to perform this type of update: （以免更新中途失败）

create a new temp file (on the same file system!)
write data to the temp file
fsync() the temp file
rename the temp file to the appropriate name
fsync() the containing directory

这个看来还是日志文件系统靠得住

Checking For Errors

When performing write I/O that is buffered by the library or the kernel, errors may not be reported at the time of the write() or the fflush() call, since the data may only be written to the page cache. Errors from writes are instead often reported during calls to fsync(), msync() or close(). Therefore, it is very important to check the return values of these calls.

Write-Back Caches（in disk）

Such a cache is lost upon power failure. However, most storage devices can be configured to run in either a cache-less mode, or in a write-through caching mode. Each of these modes will not return success for a write request until the request is on stable storage. External storage arrays often have a non-volatile, or battery-backed write-cache.

Some file systems provide mount options to control cache flushing behavior. For ext3, ext4, xfs and btrfs as of kernel version 2.6.35, the mount option is "-o barrier" to turn barriers (write-back cache flushes) on (the default), or "-o nobarrier" to turn barriers off. Previous versions of the kernel may require different options ("-o barrier=0,1"), depending on the file system. Again, the application writer should not need to take these options into account. When barriers are disabled for a file system, it means that fsync calls will not result in the flushing of disk caches. It is expected that the administrator knows that the cache flushes are not required before she specifies this mount option.

1 The x32 system call ABI

原来64位下的32位兼容模式需要将32位扩展到64位，leads to expanded memory use and a larger cache footprint。

That best-of-both-worlds situation is exactly what the is trying to provide. A program compiled to this ABI will run in native 64-bit mode, but with 32-bit pointers and data values. The full register set will be available, as will other advantages of the 64-bit architecture like the faster SYSCALL64 instruction. If all goes according to plan, this ABI should be the fastest mode available on 64-bit machines for a wide range of programs; it is easy to see x32 widely displacing the 32-bit compatibility mode.

目前x32 ABI仍在讨论中，time_t, struct timespec, and struct timeval等类型可能会扩展到64位，以避免2038问题。

2 No-I/O dirty throttling

目前write back的处理方法，由相关的进程负责direct reclaim

One aspect to getting a handle on writeback, clearly, is slowing down processes that are creating more dirty pages than the system can handle. In current kernels, that is done through a call to balance_dirty_pages(), which sets the offending process to work writing pages back to disk. This "direct reclaim" has the effect of cleaning some pages; it also keeps the process from dirtying more pages while the writeback is happening. Unfortunately, direct reclaim also tends to create terrible I/O patterns, reducing the bandwidth of data going to disk and making the problem worse than it was before.

目前还难以进入mainline

create a control loop capable of determining how many pages each process should be allowed to dirty at any given time. Processes exceeding their limit are simply put to sleep for a while to allow the writeback system to catch up with them.

3 Broadcom's wireless drivers, one year later

Broadcom的开发人员被打击了，开发了一年的驱动brcmsmac 功能上没有超过已进入mainline基于逆向工程弄出来的b43

阅读(973) | 评论(0) | 转发(0) |

上一篇：lwn.net kernel news 2011/8

下一篇：lwn.net kernel news 2011/10

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6