lwn.net kernel news 2010/4-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621450
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2010/4

分类： LINUX

2011-02-21 23:36:42

1 CPUS*PIDS = mess
2048 * 16 processes on each CPU = default 32K limit process IDs
加上缺省kernel thread, 可用pID 很多
目前无有效解决方案.

2 Suspend block
Android的API
allow the system to automatically suspend itself when nothing is going on, and allow code to say "something is going on" at both the kernel and user-space levels.

3 Kernel Hacker's Bookshelf: Generating Realistic Impressions for File-System Benchmarking
"
This paper describes Impressions, a tool for generating realistic, reproducible file system images which can serve as the base of new file system benchmarks.

4 Might 2.6.35 be BKL-free
5The cpuidle subsystem
On your editor's laptop, there are three idle states with the following characteristics:

	C1	C2	C3
Exit latency (µs)	1	1	57
Power consumption (mW)	1000	500	100

各CPU特性不同, cpuidle code 抽象成一个dirver layer,特定于某个结构. 但是决定进入哪一个 idle state是于平台无关的,由cpuidle "governors" 来处理.
Every processor has different idle-state characteristics and different actions are required to enter and leave those states. The cpuidle code abstracts that complexity into a separate driver layer; the drivers themselves are often found in architecture-specific or ACPI code. On the other hand, the decision as to which idle state makes sense in a given situation is very much a policy issue. The cpuidle "governors" interface allows the implementation of different policies for different needs.

首先要注册cpuidle_driver, 然后Once the driver exists, though, it can register a cpuidle "device" for each CPU in the system.
每一个cpuidle "device" 可能有多个状态供选择,
struct cpuidle_state {
......
unsigned int flags;
unsigned int exit_latency; /* in US */
unsigned int power_usage; /* in mW */
unsigned int target_residency; /* in US */
.....
int (*enter) (struct cpuidle_device *dev,struct cpuidle_state *state);
};
enter成员起状态切换作用, 由 governor 触发

A call to enter() is a request from the current governor to put the CPU associated with dev into the given state. Note that enter() is free to choose a different state if there is a good reason to do so, but it should store the actual state used in the device's last_state field.

当cpu 无事可做时调用cpuidle governors进行选择.

Governors implement the policy side of cpuidle. The kernel allows the existence of multiple governors at any given time, though only one will be in control of a given CPU at any time. Governor code begins by filling in a cpuidle_governor structure:

struct cpuidle_governor {
char name[CPUIDLE_NAME_LEN];
unsigned int rating;

int (*enable) (struct cpuidle_device *dev);
void (*disable) (struct cpuidle_device *dev);
int (*select) (struct cpuidle_device *dev);
void (*reflect) (struct cpuidle_device *dev);

struct module *owner;
/* ... */
};
The select() function is called whenever the CPU has nothing to do and wishes the governor to pick the optimal way of getting that nothing done.

1 Ceph: The Distributed File System Creature from the Object Lagoon (Linux Mag)
Linux Magazine has which was merged for 2.6.34.

2 Fixing the ondemand governor
"cpufreq" subsystem 有三种governor
The "performance" governor prioritized throughput above all else, while the "powersave" tries to keep power consumption to a minimum. The most commonly-used governor, though, is "ondemand," which attempts to perform a balancing act between power usage and throughput.

工作方式: 系统空闲时降低freq, 否则升高freq
ondemand works like this: every so often the governor wakes up and looks at how busy the CPU is. If the idle time falls below a threshold, the CPU frequency will be bumped up; if, instead, there is too much idle time, the frequency will be reduced. By default, on a system with high-resolution timers, the minimum idle percentage is 5%; CPU frequency will be reduced if idle time goes above 15%. The minimum percentage can be adjusted in sysfs (under /sys/devices/system/cpu/cpuN/cpufreq/); the maximum is wired at 10% above the minimum.

当前ondemand governor的不足之处: I/O-intensive 和CPU-intensive频繁切换导致cpu freq频繁切换,性能下降,临时处理方式:The accounting of "idle time" is changed so that time spent waiting for disk I/O no longer counts.

3 DM and MD come a little closer
What the Linux kernel has, instead, is three different RAID implementations: in the multiple device (MD) subsystem, in the device mapper (DM) code, and in the Btrfs filesystem. It has often been said that unifying these implementations would be a good thing, but and thus far, it has not happened.

4 ELC: Using LTTng
简略讨论了LTTng及其未来

5 When writeback goes wrong
非常有趣的话题
There are two distinct ways in which writeback is done in contemporary kernels. A series of kernel threads handles writeback to specific block devices, attempting to keep each device busy as much of the time as possible. But writeback also happens in the form of "direct reclaim," and that, it seems, is where much of the trouble is. Direct reclaim happens when the core memory allocator is short of memory; rather than cause memory allocations to fail, the memory management subsystem will go casting around for pages to free.

direct claim容易导致栈溢出，Dave's answer was which disables the use of writeback in direct reclaim. Instead, the direct reclaim path must content itself with kicking off the flusher threads and grabbing any clean pages which it may find.There is another advantage to avoiding writeback in direct reclaim. The per-device flusher threads can accumulate adjacent disk blocks and attempt to write data in a way which minimizes seeks, thus maximizing I/O throughput.

Direct reclaim is also where is done. The lumpy reclaim algorithm attempts to free pages in physically-contiguous (in RAM) chunks, minimizing memory fragmentation and increasing the reliability of larger allocations. There is, unfortunately, a tradeoff to be made here: the nature of virtual memory is such that pages which are physically contiguous in RAM are likely to be widely dispersed on the backing storage device. So lumpy reclaim, by its nature, is likely to create seeky I/O patterns, but skipping lumpy reclaim increases the likelihood of higher-order allocation failures.

可看性不错
1 Idle cycle injection
Idle cycle injection is the forced idling of the CPU to avoid overheating; essentially, it is Google's way of running processors to the very edge of their capability without going past that edge and allowing the smoke to escape.

Salman Qazi's recently posted shows the current form of this work.

The core idea is simple: through some new control files under /proc/sys/kernel/kidled, the system administrator can set, on a per-CPU basis, the percentage of time that the CPU should be idle and an interval over which that percentage is calculated. If the end of an interval draws near and the CPU has not been naturally idle for the requisite time, kidled will force the processor to go idle for a while.

2 ELC: Status of embedded Linux

比较有希望的patch
kbuild CROSS_COMPILE option, which will make it easier to build for multiple architectures.
Arnd Bergmann's patches that are geared towards making it easier to add new architectures to the kernel—without propagating the bugs and quirks from existing ones.

Boot speed The Moblin effort really kickstarted that work。
Several new kernel features are available to help reduce boot time, including , which allow some parts of device initialization to run in parallel. There is also scripts/bootgraph.pl to help visualize where boot time is being spent.
was also noted as a way to decrease boot times
Kernel size
To help embedded developers make better use of limited memory, there is the that was funded by CELF. Various compression methods have been added to compress the kernel image in different ways. LZMA can be up to 30% better than gzip, and LZO is not as good at compression, but is much faster.The ramzswap device (also known as ) allows in-memory compressed swap.

一些特性总结见

3 The case of the overly anonymous anon_vma
参见我写的http://blog.chinaunix.net/space.php?uid=1858380&do=blog&id=93302

本期可看性较好

1 A "live mode" for perf

Next up would seem to be , where perf no longer requires two steps: record the data, then analyze. Live mode will allow perf trace record and perf trace report to operate via a pipe, which allows instantaneous, as well as continuously updating (a la top), output.

2 Bootmem 可能要被淘汰

Yinghai has a on the horizon for 2.6.35: replacing the early_res code with the "logical memory block" allocator currently used by some other architectures. That change looks even more disruptive than the bootmem elimination was.

3Memory management for virtualization

三个权宜技术:high memory ,huge pages, balloon drivers

Like high memory and transparent huge pages, balloon drivers may eventually be consigned to the pile of failed technologies. Until something better comes along, though, we'll still need them.

4.Receive flow steering

包发往应用程序所在CPU,细节有待进一步了解

receive packet steering (RPS) patches, provide a way to steer packets to particular CPUs based on a hash of the packet's protocol data. Those patches were applied to the network subsystem tree and are bound for 2.6.35, but now Herbert is back with an enhancement to RPS that will attempt to steer packets to the CPU on which the receiving application is running: (RFS).

5The padata parallel execution mechanism

In short: padata is a mechanism by which the kernel can farm work out to be done in parallel on multiple CPUs while retaining the ordering of tasks. It was developed for use with the IPsec code,

阅读(1223) | 评论(0) | 转发(0) |

上一篇：lwn.net kernel news 2011/1

下一篇：Linux生活

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6