lwn.net kernel news 2012/3-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621325
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2012/3

分类： LINUX

2012-04-27 20:23:46

l The prctl() system call has a new option called PR_SET_CHILD_SUBREAPER. Marking a process this way will cause any orphan descendant processes to be reparented to the marked process rather than to the init process. There is a corresponding PR_GET_CHILD_SUBREAPER option as well.

l The ext4 "noacl" and "noattr" mount options have been marked deprecated with an eye toward removal in the near future. Without these options, it will not be possible to disable ACL and extended attribute support. No other filesystem allows that support to be disabled. The "journal=update" and "resize" mount options have been removed entirely. On the other hand, plans to remove the "bsd_df", "minix_df", "grpid" and "nogrpid" options have been dropped in response to complaints from users.

l A new subsystem called "remoteproc" has been merged; it allows for the control of remote processors (those on the same SoC but running something other than Linux) through shared memory. The new "rpmsg" subsystem is a virtio-based mechanism for communicating with those processors. There will probably be a separate article on these facilities soon; in the meantime, see Documentation/remoteproc.txt and for more information.

As described on the , the integrity subsystem is meant to thwart various kinds of attacks against the contents of files, both on- and off-line. Much of IMA was added to the kernel in 2.6.30, but another piece, the (EVM) was not merged until 3.2. Digital signature support was added to EVM in 3.3, and IMA appraisal is currently under review.

The integrity measurement architecture (IMA) appraisal extension from Mimi Zohar and Dmitry Kasatkin fills in one missing piece: storing and validating the integrity measurement of files. A hash of a file's contents and metadata will be stored in the security.ima extended attribute (xattr) of the file, and the patch set will create and maintain those xattrs. In addition, it can enforce that the file contents are "correct" when the file is opened for reading or executing based on the integrity values that were stored.

参考资料：

针对Peter Zijlstra's NUMA scheduling patch set的又一个竞争翻案。Andrea Arcangeli has posted a NUMA scheduling patch set of his own called . Andrea's patch lacks the concept of home nodes; he thinks it is an idea that will not work well for programs that don't fit into a single node unless developers add code to use Peter's new system calls. Instead, Andrea would like NUMA scheduling to "just work" in the same way that transparent huge pages do. So his patch set seems to assume that resources will be spread out across the system; it then focuses on cleaning things up afterward. The key to the cleanup task is a bunch of statistics and a couple of new kernel threads.

Brendan Gregg demonstrates "flame graphs" as a tool for tracking down kernel performance problems.

Linsched is a framework that can run the kernel scheduler with various simulated workloads and draw conclusions about the quality of the decisions made. It looks at overall CPU utilization, the number of migrations, and more. It is able to simulate a wide range of hardware topologies with different characteristics.

目前只需要修改内核20 行来行代码， The rest has been cleverly hidden in a special "linsched" architecture that provides just enough support to run the scheduler in user space. The actual simulation and measurement code lives in the tools directory.

l The perf utility understands a new --uid flag, which restricts data gathering to processes owned by the given user ID. It is also now possible to specify multiple processes or threads with the --pid and --tid options.

l The perf events subsystem can now sample "taken branch" events on hardware with the "last branch record" functionality.

l The "Yama" security module has been merged; for now it just implements some restrictions on how the ptrace() system call can be used, but others may follow. Yama is meant to be a place to collect various discretionary access control mechanisms intended to make a system more secure.

l Jump labels have been rebranded again; after a they are now known as "static keys". Details can be found in the new Documentation/static-keys.txt file.

l The debugfs filesystem understands the uid=, gid=, and mode= mount options, allowing the ownership and permissions for the filesystem to be set in /etc/fstab.

l The has been merged.

l The list of power management stages continues to grow; the kernel has new callbacks called suspend_late(), resume_early(), freeze_late(), thaw_early(), poweroff_late(), and restore_early() for operations that must be performed at just the right time.

l The "IRQ domain" abstraction has been merged; IRQ domains make it easier to manage interrupts on systems with more than one interrupt controller. See Documentation/IRQ-domain.txt for more information.

improve on printk()的努力，如dev_dbg，dev_info等比较成功，These functions, by embedding the logging level in the name itself, are more concise than the printk() calls they replace. They also print the name of the relevant device in standard form, ensuring that it's always possible to associate a message with the device that generated it.

另一个努力是一系列宏

int pr_info(const char *format, ...);

int pr_emerg(const char *format, ...);

These functions, too, encode the logging level in the function name, making things more concise. They also attempt to at least minimally standardize the format of logging by passing the format string through a macro called pr_fmt().

目前该系列进军Ext4失败。

4 （值得一看）

已有的工作：

There is also an interface (available via the mbind() system call) by which a process can request a specific allocation policy for its memory. Possibilities include requiring that all allocations happen within a specific set of nodes (MPOL_BIND), setting a looser "preferred" node (MPOL_PREFERRED), or asking that allocations be distributed across the system (MPOL_INTERLEAVE). It is also possible to use mbind() to request the active migration of pages from one node to another

问题：with the scheduler free to move the task about at will, the task's memory can end up being spread all over the machine's nodes. 对于 long-running processes with a large memory footprint是个灾难

解决：Peter Zijlstra

There are three major sub-parts to Peter's patch set. 首先：page migration采用lazy机制，首先unmap，page fault时才migration。"Page migration" is the process of moving a page from one node to another without the owning process(es) noticing the change.

其次：the second part of the patch set starts by adding the concept of a "home node" to a process. Each process (or "NUMA entity" - meaning groups containing a set of processes) is assigned a home node at fork() time. The scheduler will then try hard to avoid moving a process off its home node。When the scheduler notices that long-running tasks are being forced away from their home nodes - or that they are having to allocate memory non-locally - it will consider migrating them to a new node. Migration is not a half-measure in this case; the scheduler will move both the process and its memory (using the lazy migration mechanism) to the target node.

最后：The final piece is a pair of new system calls allowing processes to be put into "NUMA groups" that will share the same home node. If one of them is migrated, the entire group will be migrated.

Paul McKenney with an eye toward how it might be useful for current software.

Tejun Heo has boiled down the comments and as to where he would like to go with this subsystem.，multiple hierarchies可能被移除。

Red Hat Enterprise Linux 6; its kernel is ostensibly based on the 2.6.32 release. The actual kernel, as shipped by Red Hat, differs from 2.6.32 by around 7,700 patches, though. Many of those are fixes, but others are major new features, often backported from more recent releases. Thus, the RHEL "2.6.32" kernel includes features like per-session group scheduling, receive packet/flow steering, transparent huge pages, pstore, and, of course, support for a wide range of hardware that was not available when 2.6.32 shipped. backport代价巨大，但是RedHat有这样的实力。

Both and Oracle's Unbreakable Enterprise Kernel Release 2 feature much more recent kernels - 3.0.10 and 3.0.16, respectively. 虽然可能利用新特性和现成的security fix，但是“The cost of stabilizing a new kernel release, it is suggested, could exceed that of backporting desired features into an older release.”，可能有新的bug。

Traditionally, the kernel has allowed the modification of pages in memory while those pages are in the process of being written back to persistent storage. If a process writes to a section of a file that is currently under writeback, that specific writeback operation may or may not contain all of the most recently written data. This behavior is not normally a problem; all the data will get to disk eventually, and developers (should) know that if they want to get data to disk at a specific time, they should use the fsync() system call to get it there.

Stable page引入的原因：

Some storage hardware can transmit and store checksums along with data; those checksums can provide assurance that the data written to (or read from) disk matches what the processor thought it was writing. If the data in a page changes after the calculation of the checksum, though, that data will appear to be corrupted when the checksum is verified later on. Volatile data can also create problems on RAID devices and with filesystems implementing advanced features like data compression. For all of these reasons, the feature was added to ext4 for the 3.0 release

Stable page导致的新问题：With this feature, pages under writeback are marked as not being writable; any process attempting to write to such a page will block until the writeback completes. It is a relatively simple change that makes system behavior more deterministic and predictable。但是processes performing writes can find themselves blocked for lengthy periods (multiple seconds) of time. 原因很简单如log文件，在writeback的同时在写入，导致等待。I/O queues are long的话延迟就长。

目前还没有合理的解法办法。

From a device driver author's point of view, nothing should change. (or CMA) is integrated with the DMA subsystem, so the usual calls to the DMA API (such as dma_alloc_coherent()) should work as usual. In fact, device drivers should never need to call the CMA API directly, since instead of bus addresses and kernel mappings it operates on pages and page frame numbers (PFNs), and provides no mechanism for maintaining cache coherency.

CMA在启动时期预留一块内存称为CMA area or a CMA context，然后返还给buddy allocator，通常是在memblock allocator结束后，一般有两条语句

void dma_contiguous_reserve(phys_addr_t limit);

void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size);//平台特定的初始化

To allocate CMA memory one uses:

struct page *dma_alloc_from_contiguous(struct device *dev, int count, unsigned int align);

Beware that dma_alloc_from_contiguous() may not be called from atomic context. It performs some “heavy” operations such as page migration, direct reclaim, etc., which may take a while.

CMA operates on contexts. Devices use one global area by default, but private contexts can be used as well.使用

int dma_declare_contiguous(struct device *dev, unsigned long size,

phys_addr_t base, phys_addr_t limit);

内部原理：One of the migrate types is MIGRATE_MOVABLE. The idea behind it is that data from a movable page can be migrated (or moved, hence the name), which works well for disk caches, process pages, etc.

To keep pages with the same migrate type together, the buddy allocator groups pages into "pageblocks," each having a migrate type assigned to it. The allocator then tries to allocate pages from pageblocks with a type corresponding to the request. If that's not possible, however, it will take pages from different pageblocks and may even change a pageblock's migrate type. This means that a non-movable page can be allocated from a MIGRATE_MOVABLE pageblock which can also result in that pageblock changing its migrate type. This is undesirable for CMA, so it introduces a MIGRATE_CMA type which has one important property: only movable pages can be allocated from a MIGRATE_CMA pageblock.

下面我们假定某些页面被其他子系统占用，当driver需要时，首先把指定的区间标记为MIGRATE_ISOLATE，保证该区间不会被buddy考虑，然后移动某些已被用的页面，就可以供driver使用了。

The first set was by Glauber Costa, the author of the related controller. Glauber's patch works at the slab allocator level; only the SLUB allocator is supported at this time. With this approach, developers must explicitly mark a slab cache for usage tracking.

The comes from Suleiman Souhlal. Here, too, the slab allocator is the focus point for memory allocation tracking, but this patch works with the "slab" allocator instead of SLUB. One other significant difference with Suleiman's patch is that it tracks allocations from all caches, rather than just those explicitly marked for such tracking. There is a new __GFP_NOACCOUNT. flag to explicitly prevent tracking.

目前两个方案有合作意向

2 Statistics for the 3.3 development cycle

It has been an active cycle, with some 10,350 changesets merged from just over 1,200 developers. Some 563,000 lines of code were added to the kernel, but 395,000 lines were removed, for a net growth of about 168,000 lines.

非常技术化的一篇文章，值得一看，处理的方案未完全消化。

NMI的用途：

These non-maskable interrupts are used by tools like profiling and watchdogs. For profiling, information about where the CPU is spending its time is recorded, and, by ignoring disabled interrupts, the profiler can record time spent with interrupts disabled. If profiling used normal interrupts, it could not report that time. Similarly, a watchdog needs to detect if the kernel is stuck in a location where interrupts were disabled. Again, if a watchdog used normal interrupts, it would not be useful in such situations because it would never trigger when the interrupts were disabled.

NMI的特点：

Although NMIs can trigger when interrupts are disabled and even when the CPU is processing a normal interrupt, there is a specific time when an NMI will not trigger: when the CPU is processing another NMI. On most architectures, the CPU will not process a second NMI until the first NMI has finished. When a NMI triggers and calls the NMI handler, new NMIs must wait till the handler of the first NMI has completed.

The x86 NMI iret flaw：

如果NMI中发生异常，执行iret，会允许NMI，导致NMI嵌套。If the NMI handler triggers either a page fault or breakpoint, the iret used to return from those exceptions will re-enable NMIs. The NMI handler will not be put back to the state that it was at when the exception triggered, but instead will be put back to a state that will allow new NMIs to preempt the running NMI handler. If another NMI comes in, it will jump into code that is not designed for re-entrancy. Even worse, on x86_64, when an NMI triggers, the stack pointer is set to a fixed address (per CPU). If another NMI comes in before the first NMI handler is complete, the new NMI will write all over the preempted NMIs stack. The result is a very nasty crash on return to the original NMI handler. The NMI handler for i386 uses the current kernel stack, like normal interrupts do, and does not have this specific problem.

当前直接的限制：

Because of this x86 NMI iret flaw, NMI handlers must neither trigger a page fault nor hit a breakpoint.

A vmalloc 不能用If a module were to register an NMI handler callback, that callback could cause the NMI to become re-entrant.

B As breakpoints also return with an iret, they must not be placed in NMI handlers either. This prevents kprobes from being placed in NMI handlers. Kprobes are used by ftrace, perf, and several other tracing tools to insert dynamic tracepoints into the kernel.

Removing stop machine的需要

In short, a call to stop_machine() stops execution on all other CPUs so that the calling CPU has exclusive access to the entire system. For machines with thousands of CPUs, a single call to stop_machine() can introduce a very large latency. Currently one of the areas that uses stop_machine() is the runtime modification of code.

The Linux kernel has a history of using self-modifying code. That is, code that changes itself at run time. For example, distributions do not like to ship more than one kernel, so self-modifying code is used to change the kernel at boot to optimize it for its environment. Modifying code at boot time is not that difficult，因为代码还未开始执行。Today, there are several utilities in the Linux kernel that modify the code after boot. These modifications can happen at any time, generally due to actions by the system's administrator. The ftrace function tracer can change the nops that are stubbed at the beginning of almost every function into a call to trace those functions.

困难之处：

Modifying code at run time takes much more care than modifying code during boot. On x86 and some other architectures, if code is modified on one CPU while it is being executed on another CPU, it can generate a General Protection Fault (GPF) on the CPU executing the modified code.

The way to get around this is to call stop_machine()，Being able to modify code without stop_machine() is a very desirable result.Being able to modify code without stop_machine() is a very desirable result，一种办法就是利用breakpoint。

如何修订处理nested NMI：

Linus的方案不可行，因为The iret of all exceptions, including NMIs, already has a fault handler，iret可能导致fault，以前不知道。

阅读(1904) | 评论(0) | 转发(0) |

上一篇：XenLinux balloon的实现分析

下一篇：Intel内存虚拟化技术分析

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6