lwn.net kernel news 2012/2-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 617480
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2012/2

分类： LINUX

2012-04-14 22:46:07

的一个重新实现，称为static key。奇怪的是，文档中似乎根本没有提及以前jump labels的努力（Currently, tracepoints are implemented using a conditional branch.，我所知道的不是如此）。

as described in this document. A static key is defined with one of:

struct static_key key = STATIC_KEY_INIT_FALSE;

struct static_key key = STATIC_KEY_INIT_TRUE;

The initial value should match the normal operational setting - the one that should be fast. Testing of static keys is done with:

if (static_key_false(&key)) {

/* Unlikely code here */

}

Tejun Heo started the discussion with describing his complaints with the control group mechanism and some thoughts on how things could be fixed.

主要包括

l 取消multiple hierarchies

The idea behind multiple hierarchies is that they allow different policies to be applied using different criteria. 反对理由，几乎没有实际应用而且complicate the control group implementation significantly.

l different controllers treat the control group hierarchy differently. In particular, a number of controllers seem to have gone through an evolutionary path where the initial implementation does not recognize nested control groups but, instead, simply flattens the hierarchy.

老话题，但是比较宽泛。A face-to-face meeting of the , which is a community interested in upstreaming the Android patch set into the Linux kernel, was organized by Tim Bird (Sony/CEWG) and hosted by Linaro Connect on February 10th.

The meeting was held to give better visibility into the various upstreaming efforts going on, making sure any potential collaboration between interested parties was able to happen. It covered a lot of ground, mostly discussing Android features that recently were re-added to the staging directory and how these features might be reworked so that they can be moved into mainline officially.

嵌入式开发者应该一读。嵌入式开发者联合努力，避免重复工作，提高代码质量to upstream。将有针对嵌入式的发行版，各厂商基于该发行版再rework。

Back in October 2011, the Long-term Support Initiative (LTSI) was at the LinuxCon Europe event by Linux Foundation board member (and NEC manager) Tsugikazu Shibata.

Shibata-san again presented about LTSI at ELC 2012, giving more details about how the project would be run, and how to get involved. The full slides for the presentation [PDF] are available online.

The LTSI project's goals were summarized as:

Create a longterm community-based Linux kernel to cover the embedded life cycle. （选一个内核长期维护）
Create an industry-managed kernel as a common ground for the embedded industry. （加上大部分嵌入式厂商都需要的补丁）This LTSI kernel tree is now public, and can be browsed at .
Set up a mechanism to support upstream activities for embedded engineers.

see for information on how to get involved.

The poll() system call has three parameters, one of which is a timeout value specifying an upper bound (in milliseconds) for how long the process will wait. The manual page indicates that the type of this value is int. For reasons lost in history, though, the kernel's internal implementation of poll() has always expected the timeout value to be a long integer. And that has created a source of occasional bugs. 内核已经修订为int类型。一个有趣的bug见

sleep(0) 标准未定义行为 some developers put such calls in as a way to relinquish the CPU for a short period of time. The idea is to be nice and allow other processes to run briefly before continuing execution.

Linux的内核行为：sleep(0) would always put the calling process to sleep for at least one clock tick. When high-resolution timers were added to the kernel, the behavior changed: if a process asked to sleep on an already-expired timer (which is the case for a zero-second sleep), the call simply returned directly back to the calling process. 目前在timer slack机制下，有时会推迟几秒钟。目前的决定是让用户态程序崩溃，内核不改动

非常值得一读的文章

Rough notes from the session can be found .

The main goals of the mini-summit were as follows:

Take first step towards planning any Linux-kernel scheduler changes that might be needed for ARM's upcoming systems to work well (see also ).
Create a power-aware infrastructure for scheduling and related Linux kernel subsystems. For example, integrate dyntick-idle, cpufreq, cpuidle, , timers, thermal framework, pm_qos, and the scheduler.
Provide a usable mechanism that reliably allows all work (present and future) to be moved off of a CPU so that said CPU can be powered off and back on under user-application control. CPU hotplug is used for this today, but has some serious side effects（太慢）, so it would be good to either fix CPU hotplug or come up with a better mechanism—or, in the best Linux-kernel tradition, both.

Decisions on what CPUs to use should include a number of considerations. First, if a LITTLE CPU is able to provide sufficient performance, it provides better energy efficiency, at least in cases where race to idle is inappropriate. Second, because mobile platforms have no fans and are sometimes sealed, some devices might not be able to run all the big CPUs at maximum clock rate for very long without overheating. Of course, such devices might also need to limit the heat produced by analog electronics and GPUs as well (see Carroll's and Heiser's 2010 and for a power-consumption analysis of a ca. 2008 smartphone). Third, some workloads can adapt themselves to lower performance. For example, some media applications can reduce performance requirements by dropping frames and reducing resolution. Fourth, there is more to performance than CPU clock speed: For example, it is possible that a workload with high cache-miss rates can run just as fast on a LITTLE CPU as it can on a big CPU. Finally, many workloads will have preferred ways of using the CPUs, for example, some mobile workloads might use the LITTLE CPUs most of the time, but bring the big CPUs online for short bursts of intense processing.

非常技术性的一篇文章，没有完全懂。

suspend-to-RAM state的bug。

什么是suspend？

"suspend" is an OS concept more than a hardware concept - at least it in on ARM. On X86 it is also a firmware concept imposed by the BIOS (once the APM bios, now the ACPI bios).

To Linux "suspend" means:

. - freeze all processes

. - give extra encouragement to devices to go to low power states

. - disable some interrupts

and wait for a non-disabled interrupt.

irq_chip可能组成树形结构

Linux has an abstraction called an "irq_chip". An irq_chip represents a set of interrupt sources each of which is assigned an interrupt number (as listed in /proc/interrupts). It provides functions to enable or disable each interrupt, to allow the interrupt to wake the device from suspend, to set the trigger type (edge or level), and various other functions. It also arranges that the interrupt handler for each interrupt will be called when appropriate.

suspend和中断的一些细节

One of the many things that happens on the way to suspend is that each individual interrupt gets disabled - unless it is flagged as IRQF_NO_SUSPEND in which case it is left alone. However this doesn't mean exactly what it sounds like it means. Being "disabled" just means that the handler routine will not be run, it doesn't mean that the interrupt will not be generated. We have a different word for that, which is "masking". When an interrupt is masked the originating source of the interrupt is told to never post that interrupt.

Linux uses a lazy scheme for disabling interrupts. When the disable request is made, the fact is recorded in internal data structures, but that is all. If the interrupt is subsequently delivered, only then might the interrupt be masked. This can be a useful optimization as masking an interrupt can take a lot longer than just setting a flag in memory.

So, on the way to suspend, interrupts are disabled but not masked. If the interrupt does actually arrive before we reach full suspend, the fact is recorded. If it was an interrupt that should wake from suspend, this is detected in and suspend aborts. If the interrupt doesn't arrive before full suspend, then it is still unmasked and will successfully wake up the device, which will resume and then handle the interrupt. This might all seem a bit complex, but once it is fully understood it is actually quite neat and it works well ... except for my RTC alarm.

结论

l There are complex interactions between distinct hardware. Here the USB interface and the battery charger have important interdependencies that need to be reflected in the drivers. Some interdependencies already are but there are subtleties that are easy to miss.

l The requirements of an "irq_chip" are not really documented anywhere, and given the current rate of development there is a good chance that such documentation would be out of date.

l The IRQF_NO_SUSPEND flag is clearly important, but not easy to understand.

老话题 The upshot seems to be that mainline kernel support for Android is moving along reasonably well. It won't be too long before it will be possible to run Android on a mainline kernel while still maintaining some reasonable battery life.

来源于以前看过的一篇论文SOSP。如果不对硬件错误进行检查，可能会导致系统崩溃

What would be nice would a way for the computer to tell developers when they are being overly trusting of the hardware; then it might be possible to skip the "tracking down the weird problem" experience. As it happens, such a way exists in the form of a static analysis tool called , developed by Asim Kadav, Matthew J. Renzelmann and Michael M. Swift. Those wanting a lot of information about this tool can find it in , , or in .

3 （推荐）

非常有意思的一篇文章。

ARM Ltd recently announced the 该结构是MP，而非SMP。a cluster of cores, a cluster of cores, and ensuring cache coherency between them. The advantage of such an arrangement is that it allows for significant power saving when processes that don't require the full performance of the Cortex-A15 are executed on the Cortex-A7 instead. This way, non-interactive background operation, or streaming multimedia decoding, can be run on the A7 cluster for power efficiency, while sudden screen refreshes and similar bursty operations can be run on the A15 cluster to improve responsiveness and interactivity.

Linaro started an during the most recent to investigate this problem.

ARM Ltd BSD licensed 利用虚拟机在两个cluster之间切换，没有充分利用8 processors。一个可能的临时方案是We can implement this switcher by modeling its functionality as a CPU speed change, and therefore expose it via a cpufreq driver.

Here is a posting on the Intel software network describing the "transactional synchronization extensions" feature to be found in the future "Haswell" processor. 值得一看

Android "Opportunistic suspend" 的一个新方案

如何force unused banks of memory into partial-array self-refresh (PASR) mode. 新的 works at a lower level beneath the buddy allocator. 好处是对原有系统的改动小，

但是目前还不完善，没有mechanism by which a memory section becomes entirely free and eligible for PASR. The other thing that is missing at this point is any kind of measurement of how much power is actually saved using PASR.

ON is a generalized memory manager that Google introduced in the Android 4.0 ICS (Ice Cream Sandwich) release to address the issue of fragmented memory management interfaces across different Android devices. This article takes a look at ION, summarizing its interfaces to user space and to kernel-space drivers. Besides being a memory pool manager, ION also enables its clients to share buffers, hence it treads the same ground as (DMABUF). This article will end with a comparison of the two buffer sharing schemes.

SSD在2.6.39吞吐量下降很多。原因是The 2.6.39 kernel saw , with the result that the plugging and unplugging of queues is now explicitly managed in the I/O submission code. 另外The function that handles basic buffered file I/O (()) also now does its own plugging. 这样的话generic_file_aio_read的I/O被plug，无法submit，使readahead不起作用。

The is to simply remove the top-level plugging in generic_file_aio_read() so that readahead-originated requests can get through to the hardware.

有两个候选方案。

CRIU

Cyrill Gorcunov has been working to fill in some of the gaps with for user-space checkpointing/restore with the "CRIU" tool set. There are a number of small additions to the kernel ABI to be found here:

A new children entry in a thread's /proc directory provides a list of that thread's immediate children. This information allows a user-space checkpoint utility to find those child processes without needing to walk through the entire process tree.
/proc/pid/stat is extended to provide the bounds of the process's argument and environment arrays, along with the exit code. That allows this information to be reliably captured at checkpoint time.
A number of new prctl() options allow the argument and environment arrays to restored in a way matching what was there at checkpoint time. The desired end result is that ps shows the same information about a process after a checkpoint/restore cycle as it did before.
addition of a new system call: 检查进程资源的共享情况

long kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);

people who want to see how the user-space side works can find the relevant code at .

DMTCP

the project has been busy since about 2.6.9. DMTCP differs somewhat from CRIU, though; in particular, it is able to checkpoint groups of processes connected by sockets - even across different machines - and it requires no changes to the kernel at all. These features come with a couple of limitations, though.

Checkpoint/restore with DMTCP requires that the target process(es) be started with a special script; it is not possible to checkpoint arbitrary processes on the system. That script uses the LD_PRELOAD mechanism to place wrappers around a number of libc and (especially) system call implementations. As a result, DMTCP has no need to ask the kernel whether two processes are sharing a specific resource; it has been watching the relevant system calls and knows how the processes were created. The disadvantage to this approach - beyond having to run checkpointable process in a special environment - is that, as can be seen in , not all programs can be checkpointed.

The recent improves support, though, to the point that everything a wide range of users care about should be checkpointable.

这个问题比较有意思，单独写一个帖子

阅读(1228) | 评论(0) | 转发(0) |

上一篇：Linux 杂货摊

下一篇：GDT in Linux SMP

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6