lwn.net kernel news 2011/6-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 613986
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/6

分类： LINUX

2011-10-26 11:49:42

关于ioctl() . The error-reporting side of the API is much simpler, though; if something goes wrong, the application is almost certain to get EINVAL back. That error can be trying to tell user space that the device is in the wrong state, that some parameter was out of range, or, simply, that the requested command has not been implemented.

但是if an ioctl() command has not been implemented, the kernel should return ENOTTY.

背景知识:

As a result of the high clock rates used, PCI-Express devices can take a lot of power even when they are idle. "Active state power management" () was developed as a means for putting those peripherals into a lower power state when it seems that there may be little need for them. ASPM can save power, but the usual tradeoff applies: a device which is in a reduced power state will not be immediately available for use. So, on systems where ASPM is in use, access to devices can sometimes take noticeably longer if those devices have been powered down. In some situations (usually those involving batteries) this tradeoff may be acceptable; in others it is not. So, like most power management mechanisms, ASPM can be turned on or off.

问题: 有些BIOS向操作系统报告不支持ASPM,但是依然开启某些硬件的ASPM功能,导致系统崩溃. Matthew Garrett committed clear ASPM if the FADT indicates that ASPM isn't supported. 但是有些硬件的能耗大大增加

Soc领域引发的变化

Power domains

2.6的做法:

The device model captures the connection topology of the system; this information can be used to power devices up and down in a reasonable order.

新的形势:

On newer systems, though, there are likely to be dependencies between subsystems that are not visible in the bus topology. A set of otherwise unrelated devices may share the same clock or power lines, meaning that they can only be powered up or down as a group. Different SoC designs may feature combinations of the same controllers with different power connections.

变化: Rafael Wysocki's patch set. Power domains are hierarchical, though the hierarchy may differ from the bus hierarchy. So each power domain has a parent domain (parent), a list of sibling domains (sd_node), and a list of child domains (sd_list); there is also, naturally, a list of devices contained within the domain (dev_list).

Asymmetric multiprocessing

新的形势:

OMAP4, for example, has dual Cortex-A9, dual Cortex-M3 and a C64x+ DSP. Typically, the dual cortex-A9 is running Linux in a SMP configuration, and each of the other three cores (two M3 cores and a DSP) is running its own instance of RTOS in an AMP configuration.

Asymmetric multiprocessing (AMP) is what you get when a system consists of unequal processors running different operating systems. It could be thought of as a form of (very) local-area networking, but all of those cores sit on the same die and share access to memory, I/O controllers, and more. This type of processor isn't simply "running Linux"; instead, it has Linux running on some processors trying to shepherd a mixed collection of operating systems on a variety of CPUs.

如何处理:

is an attempt to create a structure within which Linux can direct a processor of this type. It starts with a framework called "remoteproc" that allows the registration of "remote" processors. Through this framework, the kernel can power those processors up and down and manage the loading of firmware for them to run.

Once the remote processor is running, the kernel needs to be able to communicate with it. To that end, the patch set creates the concept of "channels" which can be used to pass messages between processors. These messages go through a ring buffer stored in memory visible to both processors; is used to implement the rings.

安全问题:

Various kernel log messages allow user-controlled strings to be placed into the messages via the "%s" format specifier, which could be used by an attacker to potentially confuse administrators by inserting control characters into the strings.

The problem stems from the idea that administrators will often use tools like tail and more to view log files on a TTY. If a user can insert control characters (and, in particular, escape sequences) into the log file, they could potentially cause important information to be overlooked—or cause other kinds of confusion.

So Vasiliy Kulikov has proposed a that would escape certain characters that appear in those strings,采用的whitelisting策略.但是Ingo反对

vs. blacklisting之争

In general, for user-supplied data (in web applications for example), the consensus has been to whitelist known-good input, rather than attempting to determine all of the "bad" input to exclude.

It often comes down to a choice between more security (whitelisting typically) or more usability (blacklisting).

1
and with notes from most of the talks

2

Issue:

Recently, kernel registers block devices in parallel. As a result, different device names will be assigned at each boot time. This will confuse file-system mounter, thus we usually use persistent

symbolic links provided by udev. However, dmesg and procfs outputs show device names instead of the symbolic link names. This causes a serious problem when managing multiple devices (e.g. on a large-scale storage), because usually, device errors are output with device names on dmesg. We also concern about some commands which output device names, as well as kernel messages.

Device names, particularly for disks, can be confusing to Linux administrators because they get assigned at boot time based on the order in which the disks are discovered. So the same physical disk can be assigned a different device name (in /dev) on each boot, which means that kernel log messages and the output of various utilities may not correspond with the administrator's view of the system. A recent looks to change that situation, but it is meeting some resistance from kernel hackers who think it should be handled by user space.

3 The platform device API
很简单的介绍
背景知识：
In the embedded and system-on-chip world, non-discoverable devices are increasing in number. So the kernel still needs to provide ways to be told about the hardware that is actually present."Platform devices" have long been used in this role in the kernel.

API:

要调用函数platform_driver_register注册一个struct platform_driver，该结构至少要提供probe和remove方法。
内核创建一个静态的cplatform_device结构, 该结构提供的名字要和platform_driver的匹配.
Once both a platform device and an associated driver have been registered, the driver's probe() function will be called and the device will be instantiated. Registration of device and driver are usually done in different places and can happen in either order.

4 Platform devices and device trees
a device tree is a textual description of a specific system's hardware configuration. The device tree is passed to the kernel at boot time; the kernel then reads through it to learn about what kind of system it is actually running on. With luck, device trees will abstract the differences between systems into boot-time data and allow generic kernels to run on a much wider variety of hardware.

In summary: making platform drivers work with device trees is a relatively straightforward task. It is mostly a matter of getting the right names in place so that the binding between a device tree node and the driver can be made, with a bit of additional work required in cases where platform data is in use. The nice result is that the static platform_device declarations can go away, along with the board files that contain them. That should, eventually, allow the removal of a bunch of boilerplate code from the kernel while simultaneously making the kernel more flexible.

目前大块内存的新用户: video capture engines, huge page

DMA buffers present some different requirements than huge pages.: DMA 更大,可达到10MB. huge page有2MB对齐的要求,DMA buffer要求弱一些.

The 比以前的更有希望merge.

To that end, CMA relies on the "migration type" mechanism built deeply into the memory management code. Within each zone, blocks of pages are marked as being for use by pages which are (or are not) movable or reclaimable. Movable pages are, primarily, page cache or anonymous memory pages; they are accessed via page tables and the page cache radix tree. The contents of such pages can be moved somewhere else as long as the tables and tree are updated accordingly. Reclaimable pages, instead, might possibly be given back to the kernel on demand; they hold data structures like the inode cache. Unmovable pages are usually those for which the kernel has direct pointers; memory obtained from kmalloc() cannot normally be moved without breaking things, for example. The memory management subsystem tries to keep movable pages together.

CMA extends this mechanism by adding a new "CMA" migration type; it works much like the "movable" type, but with a couple of differences. The "CMA" type is sticky; pages which are marked as being for CMA should never have their migration type changed by the kernel. The memory allocator will never allocate unmovable pages from a CMA area, and, for any other use, it only allocates CMA pages when alternatives are not available. So, with luck, the areas of memory which are marked for use by CMA should contain only movable pages, and it should have a relatively high number of free pages.

用户:

Union filesystems allow multiple filesystems to be combined and presented to the user as a single tree. In typical use, a writable filesystem is overlaid on top of a read-only base, creating the illusion that all files on the filesystem can be changed. This mode of operation is useful for live CD distributions, embedded systems where a quick "factory reset" capability is desired, virtualized systems built on a common base filesystem, and more.

前景: overlayfs 有希望merge,但现在不是时候

用法的介绍

背景知识:

Video4Linux2 drivers are charged with the task of acquiring video data from a sensor (via some sort of DMA controller, usually) and transferring those video frames to user space. The amount of data being moved makes performance a consideration; to that end, V4L2 has defined to handle streaming data. Implementing this API adds a certain amount of complexity to V4L2 drivers, but much of that complexity is the same from one driver to the next. To make life easier for driver writers (and their users), the "videobuf" subsystem was created to handle many of the details of streaming I/O buffer management.

recent patches的缺陷: the snapshot feature does not currently work with all variants of the ext4 on-disk format和会不会影响ext4的稳定性

Ext4 maintainer Ted Ts'o 承认technical concerns are not the sole driver of feature-merging decisions, 要迎合用户的需要所以该patch很可能被merge/

The "vsyscall" and "vDSO" segments are two mechanisms used to accelerate certain system calls in Linux. While their basic function (provide fast access to functionality which does not need to run in kernel mode) is the same, there are some distinct differences between them.其中vDSO的地址是动态变化的,而vsyscall的地址是固定的,这给攻击者提供了可能.

Andrew Lutomirski 就是为了解决该问题,但是带来了一定性能损失,从长远看,vsyscall会淡出.

新式的内存技术

One technology which is finding its way into some systems is called "partial array self refresh" or PASR. On a PASR-enabled system, memory is divided into banks, each of which can be powered down independently. If (say) half of memory is not needed, that memory (and its self-refresh mechanism) can be turned off; the result is a reduction in power use, but also the loss of any data stored in the affected banks. The amount of power actually saved is a bit unclear; estimates seem to run in the range of 5-15% of the total power used by the memory subsystem.

4 (重要)

Data inheritance

a concrete or "final" type inherits some data fields from a "virtual" parent type. We will call this "data inheritance" to emphasize the fact that it is the data rather than the behavior that is being inherited.

Put another way, a number of different implementations of a particular interface share, and separately extend, a common data structure. They can be said to inherit from that data structure. 有三种形式:

Extension through unions

早期的inode结构里面有如下成员

union {

struct minix_inode_info minix_i;

struct ext_inode_info ext_i;

struct msdos_inode_info msdos_i;

} u;

弊端:新文件系统的加入要修改indoe, union按最大对齐浪费空间

但是如果union项有限, 而且不需要扩充(总是有新文件系统加入就不适合)这种方式还是可取的

Embedded structures

后来的inode结构变成ext3_inode_info 后面紧接一个vfs_inode,这种方式不会浪费空间,而且容易扩展,缺点就是某种对象都需要自己的分配函数. ext3_alloc_inode就是为此而来.

另外: The use of an embedded anchor like struct list_head can be seen as a style of inheritance as the structure containing it "is-a" member of a list by virtue of inheriting from struct list_head. However it is not a strict subtype as a single object can have several struct list_heads embedded - struct inode has six (if we include the similar hlist_node). So it is probably best to think of this sort of embedding more like a "mixin" style of inheritance.

Void pointers

inode结构有一个generic_ip成员,可以指向各种扩展对象.

void *的缺点是类似goto,要看上下文才知道是否属于继承.

a void pointer being used it may not be obvious whether it is being used to point to an extension structure for data inheritance, or being used as an extension for data inheritance,即a void pointer is playing the role of a union of many other pointer types (or being used as something else altogether).

另外,struct page也是很有趣的实例.

总结:

In exploring the use of method dispatch (last week) and data inheritance (this week) in the Linux kernel we find that while some patterns seem to dominate they are by no means universal. While almost all data inheritance could be implemented using structure embedding, unions provide real value in a few specific cases. Similarly while simple vtables are common, mixin vtables are very important and the ability to delegate methods to a related object can be valuable.

We also find that there are patterns in use with little to recommend them. Using void pointers for inheritance may have an initial simplicity, but causes longer term wastage, can cause confusion, and could nearly always be replaced by embedded inheritance. Using NULL pointers to indicate default behavior is similarly a poor choice - when the default is important there are better ways to provide for it.

Matthew Garrett lists five different mechanisms to reboot 64-bit x86 hardware including: "kbd - reboot via the . The original IBM PC had the CPU reset line tied to the keyboard controller. Writing the appropriate magic value pulses the line and the machine resets.

现在PC没有keyboard controller, (they're actually part of the embedded controller) 然后用软件模拟, 问题是并不是100%(甚至完全不)兼容过去的方式

见,目前还是draft

l The , which includes the setns() system call, has been merged. This feature makes it easier to manage containers running in different namespaces.

l The XFS filesystem now has online discard support.

l The functionality has been merged. Cleancache allows for intermediate storage of pages which have been pushed out of the page cache but which might still be useful in the future. Cleancache is initially supported by ext3, ext4, and ocfs2.

l A new netlink-based infrastructure allows the management of RDMA clients.

l It is now possible to move all threads in a group into a control group at once using the cgroup.procs control file.

l The Blackfin architecture has gained perf events support.

l The btrfs filesystem has gained support for a administrator-initiated "scrub" operation that can read through a filesystem's blocks and verify checksums. When possible, bad copies of data will be replaced by good copies from another storage device. Also supported by btrfs is an auto_defrag mount option causing the filesystem to notice random writes to files and schedule them for defragmentation.

l The no-hlt boot parameter has been deprecated; no machines have needed it in this millennium. Should there be any machines with non-working HLT instructions running current kernels, they can be booted with idle=poll.

l Support for the pNFS protocol backed by object storage devices has been added.

l There is a new core support module for GPIO controllers based on memory-mapped I/O.

l There is a new atomic_or() operation to perform a logical OR operation on an atomic_t value.

4
Thomas Gleixner关于ARM生态环境的长文，没有看过的一定要看一看。
Thlx批判了fork ARM的想法，指出开源社区的开发模式和专有软件开发模式的不同，呼吁嵌入式厂商拥抱开源模式。

这里面一个重要的事实很多厂商号称“全新”的硬件，实则不然，大量的硬件设计是已有的，如Most of an SoC's IP blocks，从而大量的平台相关代码可以重用。The embedded industry often reuses hardware. Why not reuse software, too?

首先批判了fork kernel的几个理由

Time to market

在mainline时重用时的优势很明显，而且实际上most projects have even more time for upstreaming.

The one-off nature of embedded

but it is also a fact that the variations of a given SoC family have a lot in common and differ only in small details.

The SoC diversity

but if you look at the SoC data sheets, the number of unique peripheral building blocks is not excitingly large.
The diversity is often limited to a different arrangement of registers or the fact that one vendor chooses a different subset of functionality than the other.

Avoiding the useless work（avoids the bottleneck of maintainers and useless extra work in response to reviews）

Spending a bit of time reviewing other people's code is a very beneficial undertaking as it opens one's mind to different approaches and helps to better understand the overall picture. On the other side, getting code reviewed by others is beneficial as well and, in general, leads to better and more maintainable code.

再讨论fork的后果是什么？缺乏可持续的维护和开发，在Linux社区变成低等公民。

如何改进当前情况：

For over twenty years the industry dealt with closed source operating systems where were impossible and collaboration with competitors was unthinkable and unworkable.
对于开源社区来讲：必须整合各种SOC，Such consolidation requires cooperation not only across the ARM vendors, it requires a collaborative effort across many parts of the mainline kernel along with the input of maintainers and developers who are not necessarily part of the ARM universe.
we need to encourage developers to first look to see whether existing code might be refactored to fit the new device instead of blindly copying the closest matching driver。

另外一个障碍是保密原则, Competing implementations are not a bad thing per se, but the inability to exchange information and discuss design variants is not helping anyone in the "time to market" race.(例如各个SOC团队都在开放USB 3.0的支持,却不能交流共同开发)

5 (重要)

Neil Brown的长文,非常值得一读,有意思的是,后面的跟贴又引发了了一场C vs C++辩论.

Method Dispatch Summary

If we combine all the pattern elements that we have found in Linux we find that:

1. 最主要类型Method pointers that operate on a particular type of object are normally collected in a vtable associated directly with that object

2. In a mixin vtable that collects related functionality which may be selectable independently of the base type of the object. (同一个对象有多个操作集)

3. In the vtable for a "parent" object when doing so avoids the need for a vtable pointer in a populous object(例如page的操作放在address_space_operations中)

4. Directly in the object when there are few method pointers, or they need to be individually tailored to the particular object.

These vtables rarely contain anything other than function pointers, though fields needed to register the object class can be appropriate(例如module,name,list等). Allowing these function pointers to be NULL is a common but not necessarily ideal technique for handling defaults.

vtable的好处是节省空间(如果有多个对象的话,操作集又相同.各个对象操作集不同的话显然没有优势,见第4种情况),但是多了一层间接引用.

关于some function pointers in some vtables are allowed to be NULL的解释:

l incremental development reason. Thus it is possible to add a caller of the new method before any instance supports that method, and have it check for NULL and perform a default behavior.

l Another common reason is that certain methods are not particularly meaningful in certain cases so the calling code simply tests for NULL and returns an appropriate error when found.

l A final reason that vtables sometimes contain NULL is that an element of functionality might be being transitioned from one interface to another. 用新的替换老的

阅读(1036) | 评论(0) | 转发(0) |

上一篇：中文期刊的一篇烂论文-基于Intel vT-x的XEN全虚拟化实现

下一篇：lwn.net kernel news 2011/7

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6