lwn.net kernel news 2011/3-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 622603
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/3

分类： LINUX

2011-09-29 16:32:13

jump label allows the optimization of "highly unlikely" code branches to the point that their normal overhead is close to zero. This speedup is done with runtime code patching; that is also the cost: enabling or disabling the unlikely case is an expensive operation. Thus, jump label is best used for code which is almost never enabled; tracepoints and statements are obvious cases.

There were a number of complaints about the initial jump label implementation, including the fact that it was somewhat awkward to use. In response, has been posted which changes the interface considerably. One starts by declaring a "jump key":

#include

struct jump_label_key my_key;

Enabling and disabling the key is a simple matter of calling:

jump_label_inc(struct jump_label_key *key);

jump_label_dec(struct jump_label_key *key);

And using the key to control the execution of rarely-needed code becomes:

if (static_branch(&my_key)) {

/* Unlikely stuff happens here */

}

In the absence of full jump label support, a jump key is represented by an atomic_t value. jump_label_inc() becomes atomic_inc(), jump_label_dec() becomes atomic_dec(), and static_branch() is implemented with atomic_read(). If jump label is configured into the kernel, enabling and disabling a jump key become heavier operations, while static_branch() becomes nearly free. For the intended use cases for jump labels, that is a worthwhile tradeoff.

古老的APM影响cpuidle的改进，决定只保留最基本的功能，其他的去掉
3

l Beginning support has been merged. User namespaces are a sort of container where processes can safely be given root access within the container without being able to affect the rest of the system. Full container support is a long-term project, but the user namespace patches get the kernel one step closer.

l It is now possible for a suitably privileged process to write to a processes /proc/pid/mem file.

l The , intended to allow the system to export information about the topology of complex media subsystems to user space, has been merged.

l printk() and friends have a new "%pB" format specifier which prints a backtrace symbol and its offset.

l Some low-level interrupt-related functions have changed names:

Old	New
get_irq_chip()	irq_get_chip()
get_irq_chip_data()	irq_get_chip_data()
get_irq_msi()	irq_get_msi_desc()
irq_data_get_irq_data()	irq_data_get_irq_handler_data()
set_irq_chained_handler()	irq_set_chained_handler()
set_irq_chip()	irq_set_chip()
set_irq_chip_and_handler_name()	irq_set_chip_and_handler_name()
set_irq_data()	irq_set_handler_data()
set_irq_handler()	irq_set_handler()
set_irq_nested_thread()	irq_set_nested_thread()
set_irq_noprobe()	irq_set_noprobe()
set_irq_type()	irq_set_irq_type()
set_irq_wake()	irq_set_irq_wake()

4 Dynamic devices and static configuration

The OMAP-based "USB-attached" network port引发的问题

传统的做法是利用platform_data，但是usb系统不支持。

The traditional approach is through the creation of "board files"; see as an example. These files are meant to provide the kernel with enough information to understand the topology of the hardware it is running on; information related to specific devices is typically passed through a set of static platform_device structures, and through that structure's platform_data pointer in particular. As the driver initializes the device, it can refer to the platform_data pointer (which points to some sort of device-specific structure) for any information which it cannot get from the hardware itself.

一个较好的做法是device tree，但远水不解近渴。

local users的大量fork如何处理

starts with the addition of a new process tracking structure. It is organized as a simple tree reflecting the actual family structure of the processes on the system. It differs from existing data structures, though, in that this "history tree" persists even when some processes exit.

history tree定期更新。

如何检查fork bomb的发生？

see if there have been any memory allocation stalls or kswapd runs since the last check. It also looks at whether the total number of processes on the system has increased.

如何处理

Enter the fork bomb killer, which is invoked by the OOM killer. The fork bomb killer will perform a depth-first traversal of the process history tree, filling in each node with information on the total number of processes below that node and the total memory used by those processes. At the end, the process with the highest score is examined; if there are at least ten processes in the history below the high scorer, it is deemed to be a fork bomb; that process and all of its descendants will be killed.

见

old functions like simple_strtoul() will silently ignore junk at the end of an integer value, so "100xx" successfully converts to an unsigned integer type. Alternatives like strict_strtoul() have been encouraged instead, but they have problems too, including the lack of overflow checks. So what's a kernel hacker to do?

As of 2.6.39, there is a new set of string-to-integer converters which is expected to be used in preference to all others.

Unsigned conversions can be done with any of kstrtoull(), kstrtoul(), kstrtouint(), kstrtou64(), kstrtou32(), kstrtou16(), or kstrtou8().
Conversions to signed integers can be done with kstrtoll(), kstrtol(), kstrtoint(), kstrtos64(), kstrtos32(), kstrtos16(), or kstrtos8().

Some of the more significant user-visible changes include:

· The mechanism has been merged. Ipset allows the creation of groups of IP addresses, port numbers, and MAC addresses in a way which can be quickly matched in iptables rules.

· The size of the initial congestion window in the TCP stack has been increased, a change which should lead to shorter latencies for the loading of web pages and other server-oriented tasks. See for details.

· There is a new system call:

· int syncfs(int fd);

It behaves like sync() with the exception that only the filesystem containing fd will be flushed to persistent storage.

· The USB core has gained support for USB 3.0 hubs.

· The core has been added to the staging tree. Along with it came "zcache," a compressed in-memory caching mechanism.

· There is a new "multi-queue priority scheduler" queueing discipline in the networking layer which enables the offloading of quality-of-service processing work to suitably capable hardware.

· The and the Stochastic Fair Blue scheduler have been added to the networking code.

· Support for the UniCore 32-bit RISC architecture has been merged.

Changes visible to kernel developers include:

· Network drivers can now enable hardware support for receive flow steering via the new ndo_rx_flow_steer() method.

· kmem_cache_name(), which returned the name of a slab cache, has been removed from the kernel.

· The SLUB memory allocator now has a lockless fast path for allocations, speeding performance considerably. "Sadly this does nothing for the slowpath which is where the main issues with performance in slub are but the best case performance rises significantly."

· Kernel threads can be created on a specific NUMA node with the new kthread_create_on_node() function.

· The new function delete_from_page_cache() does what its name implies; unlike remove_from_page_cache() (which has now been deleted), it also decrements the page's reference count. It thus more closely mirrors add_to_page_cache().

· The new "hwspinlock" framework allows the implementation of synchronization primitives on systems where different cores are running different operating systems. See Documentation/hwspinlock.txt for more information.

统一控制printk信息输出与否

The dynamic debugging interface was added as a way of providing a uniform control interface for debugging output while avoiding cluttering the kernel with various hand-rolled alternatives.

Dynamic debug operates on print statements written with either of:

pr_debug(char *format, ...);

dev_dbg(struct device *dev, char *format, ...);

If the CONFIG_DYNAMIC_DEBUG option is not set, the above functions will be turned into normal printk() statements at the KERN_DEBUG level. If the option is enabled, though, the code sets aside a special descriptor for every call site, noting the module, function, and file names, along with the line number and format string. At system boot, all of these debug statements are turned off, so their output will not appear even if debug-level kernel messages are routed somewhere useful by the syslog daemon.

Turning on dynamic debug causes a new virtual file to appear at /sys/kernel/debug/dynamic_debug/control. Writing to that file will enable or disable specific debugging functions,.

The "pstore" filesystem provides access to platform-specific persistent storage which can be used to carry information across reboots.

"a generic layer for persistent storage usable to pass tens or hundreds of kilobytes of data from the dying breath of a crashing kernel to its successor".

There are other persistent storage methods for kernel log messages, notably devices/mtd/mtdoops.c and devices/char/ramoops.c. But those are targeted at the embedded space where NVRAM devices are prevalent or for platforms where RAM can be reserved that will not be cleared on a restart. Pstore is more flexible, as it can store more than just kernel logs, while the two *oops devices are wired into storing the output of kmsg_dump.

1 Schultz: Diving into the Linux Networking Stack, Part I
2 2.6.39 merge window part 1

The system calls have been added. The final form of the API is:

  int name_to_handle_at(int dfd, const char *name, struct file_handle *handle,
              int *mnt_id, int flag);
  int open_by_handle_at(int dirfd, struct file_handle *handle, int flags);

This functionality is intended for use by user-space file servers, which can more efficiently track files using file handles.
The open() system call has a new flag: O_PATH. A file opened with this flag will have had its path resolved by the kernel and is known to exist, but there is little else that can be done with it. System calls which operate on file descriptors directly (close() or dup(), for example) will work; these file descriptors can also be passed to another process over Unix-domain sockets using SCM_RIGHTS datagrams. The reason for the existence of O_PATH file descriptors is for use as the directory file descriptor in the various "*at()" system calls.
Tasks in the SCHED_IDLE class are now allowed to upgrade themselves into the SCHED_BATCH or SCHED_OTHER classes if their "nice" rlimit is adequate.
There is a new system call which allows the adjustment of POSIX clocks:

int clock_adjtime(clock_id which_clock, struct timex *time);

Time adjustments possible are the same as for adjtimex(), but specific POSIX clocks may not support all operations.
The POSIX clock has been added.

Changes visible to kernel developers include:

The kernel can now force (almost) all interrupt handlers to be run in threads; this capability is controlled with the threadirqs command line option. This is a useful debugging feature, as a crashing interrupt handler will, when running in a thread, merely cause a kernel oops instead of bringing down the whole system. Interrupt handlers which should never be forced into threads can be marked with IRQF_NO_THREAD, but its use is expected to be rare.
The now allows the specification of a "debug hint" function; it returns an address which can be used to better identify a specific object. See for details.
The perf events subsystem has a new monitoring mode wherein it only watches processes belonging to a specific control group. The new -G option to perf provides access to this functionality.
The feature has been added to the fair scheduler; this feature should improve performance for guests virtualized with KVM.
There is a new mechanism for the dynamic addition of POSIX clocks; see for the details of the interface.

3 Uprobes: 11th time is the charm?
有希望在将来merge into mainline。
The purpose of the uprobes subsystem: to enable the placement of probes into user-space executable process memory. These probes might be used to support a debugger like gdb or to support user-space tracing.

实现内幕：
The ptrace() interface is tied to processes; uprobes, instead, works with files. A probe is placed at a certain offset within a specific file; it will then trigger for every process which executes through the probe's location. If the code placing the probe is only interested in specific processes, it will need to filter the events itself. The interface may seem a little strange - users will probably almost always be interested in specific processes - but there are some advantages to doing things this way.
Underneath the hood, uprobes works by faulting in the page which will contain the probe. The instruction at the probe location is copied aside and replace by a breakpoint. Every process which has that file mapped then gets a pointer in its mm structure pointing to the data describing the probe(s) for that file. When a process executes the breakpoint, the probe's handler function will be called; on that handler's return, the kernel will single-step the displaced instruction, then return to the location following the probe.

4 APIs for sensors
目前的问题：
new devices are added with inconsistent interfaces, making life hard for application developers.

已有的：Video4Linux2 handles cameras and the hwmon subsystem deals with the specific class of sensors aimed at monitoring the health of the computer itself.

候选对象IIO还在staging tree中，有很长一段路，而驱动开发人员又等不及统一的接口。
industrial I/O (IIO) subsystem, which is meant "for devices that in some sense are analog to digital converters." IIO tries to handle a wide variety of sensors in some sort of standard way with support for events, higher bandwidth I/O, and more.

1 Removed directories and st_nlink
A到一个已经存在的目录B上，目录B相当于被删除了。

mkdir("foo", 0777);

    mkdir("bar", 0777);
    fd1 = open("foo", O_DIRECTORY);
    fd2 = open("bar", O_DIRECTORY);
    rename("foo", "bar");    /* kill old bar */
    rmdir("bar");         /* kill old foo */
    fstat(fd1, &buf1);
    fstat(fd2, &buf2);
正常情况buf1.st_nlink and buf2.st_nlink都应该为0，但是许多文件系统没有做到。

2 Protecting /proc/slabinfo
一个 changed the permissions of /proc/slabinfo to 0400 引发的讨论。 "nearly all recent public exploits for heap issues rely on feedback from /proc/slabinfo to manipulate heap layout into an exploitable state"，讨论结果认为不能从根本上解决问题。

Mackall 认为本质问题是too easy for programmers to copy the wrong amount of data from user space (which is how most of these object overruns occur).应该检查copy_from_user() interface .

3 Improving ptrace()
Tejun Heo posted for the improvement of ptrace。
问题背景：
interaction between tracing and job control. In an untraced process, job control is used by the kernel and the shell to stop and restart processes, possibly moving them between the foreground and the background.
加上trace后的问题和处理方法：

问题1：reparenting the traced process deprives the real parent of the ability to get notifications when that process is stopped or started. 解决办法 a traced process should always, when stopped, be in the TASK_TRACED state. The current strange transitions between TASK_TRACED state and TASK_STOPPED would go away. He would fix things so that notifications when a process stops or starts would always go to the real parent, even when a process has been reparented for tracing.
问题2：a task which is running under strace can be stopped with ^Z as usual, but the shell will be unable to restart it. 解决办法the tracing process has total control over the traced process's state. So it's up to the tracer to start a stopped process if the shell wants that done. Currently, tracers have no way to know that the real parent has tried to start a stopped process, so a notification mechanism needs to be added. That would be done by extending the STOPPED notification that can currently be obtained with one of the variants of the wait() system call.
问题3：the behavior of the PTRACE_ATTACH operation, which attaches to a process and sends a SIGSTOP signal to put it into the stopped state. The signal confuses things, and the stopped state is undesirable; 如何处理未取得共识

4 Delaying the OOM killer
OOM killer 以control groups为单位引起的问题。
it is possible for user space to take over OOM-killer duties in the control group context. Each group has a control file called oom_control which can be used in a couple of interesting ways:

Writing "1" to that file will disable the OOM killer within that group. Should an out-of-memory situation come about, the processes in the affected group will simply block when attempting to allocate memory until the situation improves somehow.
Through the use of a special eventfd() file descriptor, a process can use the oom_control file to sign up for notifications of out-of-memory events (see Documentation/cgroups/memory.txt for the details on how that is done). That process will be informed whenever the control group runs out of memory; it can then respond to address the problem.

用户态的进程可以接替OOM-killer 的工作，但是如果该进程也缺乏内存时如何处理？

google的patch，但是难以被接受。
The outcome was adding another control file to the control group called oom_delay_millisecs. Like oom_control, it holds off the kernel's OOM killer in favor of a user-space alternative. The difference is that the administrator can provide a time limit for the kernel OOM killer's patience; if the out-of-memory situation persists after that much time, the kernel's OOM killer will step in and resolve the situation with as much prejudice as necessary.

1 The debloat-testing kernel tree
for the testing of bloat mitigation and removal patches.Current patches include the, the SFB flow scheduler, some driver patches, and more.

2 Intel announces a BIOS Implementation Test Suite (BITS)
which can be used to check how the BIOS configured platform hardware in a system or to override the BIOS configuration using a known-good configuration.

3 Red Hat's "obfuscated" kernel source
Red Hat is making things harder by shipping its RHEL 6 kernel source as one big tarball, without breaking out the patches. 有违GPL精神

4 Waking systems from suspend
    RTC除了支持periodic, one-shot alarm,   还支持 alarm interrupt can be generated even when the system is suspended 。
    以前的内核支持通过sys接口直接操作RTC，但是内核只支持单个应用程序，如有多个，需要用户态协调。见。
2.6.38的改进见
    a generic "timerqueue" abstraction has been created to manage a simple list of timers that could then be shared with other areas of the kernel, like the high-resolution timers subsystem, that also have to manage timer events. The next step is to rework the RTC code so that, when an alarm is set via the character device ioctl() or sysfs interface, an rtc_timer event is created and enqueued into the per-RTC timerqueue instead of directly programming the hardware. The kernel then sets the hardware timer to fire for the earliest event in the queue. In effect, this mechanism virtualizes the RTC hardware, preserving the behavior of the existing hardware-oriented interfaces, while allowing the kernel to multiplex other events using the RTC.
     剩下的问题：如何提供接口给用户态？
直接利用RTC为时间源的话，因为与system time不一致，可能要调整，并不容易。Android 是利用hybrid approach，平常使用高精度时钟，suspend时使用RTC，但是该工作很难进入mainline。 John Stultz吸收了Android的思路实现

5 Capabilities for loading network modules
.    The CAP_SYS_MODULE capability allows loading modules from anywhere, rather than restricting the module search path to /lib/modules/.... So, by switching to use CAP_NET_ADMIN, network utilities, like ifconfig, could be restricted to only load system modules, rather than arbitrary code.
知道了一个新命令 capsh - capability shell wrapper，This tool provides a handy wrapper for certain types of capability testing and environment creation. It also provides some debugging features useful for summarizing capability state.
     引入新的使得只能加载以"netdev-" names命名，但同时又能向后兼容（老系统只支持CAP_SYS_MODULE且模块命名eth0之类）

6 Who wrote 2.6.38
The 2.6.38 cycle has seen 9,148 non-merge changesets from 1,136 developers (again, as of this writing). 603,000 lines of code were added in this cycle, and 312,000 were removed, for a net growth of 291,000 lines of code.

阅读(1817) | 评论(0) | 转发(0) |

上一篇：Vmware workstation 8.0 安装Fedora core 5.0 的共享文件夹

下一篇：Transcendent Memory笔记

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6