lwn.net kernel news 2011/8-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 622664
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/8

分类： LINUX

2011-11-08 21:21:03

Videos of three talks at the recently concluded Linux wireless summit have been posted. These talks cover the implementation of , 802.11s mesh networking, and mesh network testing with .

是QEMU的替代品，Despite its simplicity, NLKT offers "just works" networking, SMP support, basic graphics support, copy-on-write block device access, host filesystem access with 9P or overlayfs, and more. It has developed quickly and is, arguably, the easiest way to get a Linux kernel running on a virtualized system.

面临和perf tools一样的问题，只是现在还没用被merge.

3 The udev tail wags the dog

Udev 开发人员拒绝了内核提供的一个接口

A user-space project refusing to use a kernel-provided interface in the hope of forcing the creation of something better is a rather less common event. That is exactly what is happening with the udev project's approach to device tree information, though; the result could be a rethinking of how that information gets to applications.

见

什么是Platform drivers

are "bits of hardware support code" that are required to make all of the different pieces of modern hardware function with Linux. Today's hardware is not the PC of old and it requires code to make things work, especially for mobile devices.

包括范围：keyboards ,Controlling radios, ambient light sensors ("everyone wants the brightness to change when someone walks behind them"), extended battery information (using identical battery controller chips, with the interface implemented differently on each one), hard drive protection (which always use the same accelerometer device), backlight control, CPU temperature, fan control, LEDs (e.g. a "you have mail" indicator, that is "not really useful" but is exposed "for people who don't have anything better to do with their lives"), and more, all need these drivers.

Consistent interfaces

"hotkeys" are sent through the input system, "keys are keys". Backlight control is done via the backlight class. Radio control is handled with rfkill, thermal and fan state via hwmon, and the LED control using the led class.

There are two areas that still have inconsistent interfaces, Garrett said. The hard drive protection feature that is meant to park the disk heads when an untoward acceleration is detected (e.g. the laptop is dropped) does not have a consistent kernel interface. Also, the .

ACPI和WMI的关系可参见下文

1 Sharing buffers between devices

目前还是原型，见。目的很多设备可以直接访问内存，这样的话设备A和设备B可以直接通讯(share buffers in kernel mode)，没有必要通过用户态buffer。For example, an image frame captured from a camera device can often be passed directly to the graphics processor for display without all of the user-space processing that was once necessary.

如何操作？

l 通过a new Video4Linux2 ioctl() command (VIDIOC_EXPBUF) enabling the exporting of buffers as file descriptors;

l 用户态程序再把file descriptors 传递给另外一个设备B

l 设备B 利用shrbuf_import(int fd);再把文件描述符变成share buffer。这样的话无须用户态buffer的介入。

为了支持多个平台，有些硬件厂商的驱动有OS abstraction layer,但这与Linux社区的开发哲学违背。"all problems in computer science can be solved by another level of indirection." However, when the problem is developing a device driver for acceptance into the current mainline Linux kernel, OS abstraction (using a level of indirection to hide a kernel's internal API) is taking things a level too far.

The fundamental problem with OS abstraction techniques is that they actively defeat the purpose of having an open driver in the first place.

见http://blog.chinaunix.net/space.php?uid=1858380&do=blog&id=2942364的总结

2 (重要)

见

背景：

The traditional ptrace() API calls for a tracing program to attach to a target process with the PTRACE_ATTACH command; that command puts the target into a traced state and stops it in its tracks. PTRACE_ATTACH has never been perfect; it changes the target's signal handling and can never be entirely transparent to the target. So Tejun supplemented it with a new PTRACE_SEIZE command; PTRACE_SEIZE attaches to the target but does not stop it or change its signal handling in any way. Stopping a seized process is done with PTRACE_INTERRUPT which, again, does not send any signals or make any signal handling changes. The result is a mechanism which enables the manipulation of processes in a more transparent, less disruptive way.

Ptrace的新应用：

在用户态solve a difficult problem faced by checkpoint/restart implementations: capturing and restoring the state of network connections.

如何实现：用ptrace控制进程，在可执行段替换成自己的代码（最后恢复），然后执行如下操作。

A TCP connection can be snapshotted using the following sequence.

s1. Seize target process and inject a parasite thread.

s2. Acquire basic target socket information - IPs and ports.

s3. Block both incoming and outgoing packets belonging to the connection.

s4. Acquire rx queue information - the sequence number of the next byte to be read and the content of recv buffer. The former is available through SIOCGINSEQ and the latter with recvmsg(2) w/ MSG_PEEK.

s5. Acquire tx queue information-the sequence numbers of all pending packets and the content of send buffer. The former is available through SIOCGOUTSEQS and the latter SIOCPEEKOUTQ.

None of the above steps has irreversible side effect and the connection can be safely resumed. To restore the connection, the following steps can be used.

r1. Packets for the connection are still blocked from s3. Create a way to intercept those packets and inject packets - nf_queue works for the former and raw socket for the latter. It should drop all packets other than the ones injected via raw socket.

r2. Create a TCP socket, set outgoing sequence with SIOCSOUTSEQ so that it matches the sequence number at the head of the stored send queue, and initiate connection.

r3. Upon intercepting SYN, inject SYN/ACK with the sequence number matching the head of the stored rx queue.

r4. Upon intercepting ACK reply for SYN/ACK, repopulate the rx queue from the stored copy by injecting data packets and waiting for ACKs.

r5. Repopulate tx queue with send(2) with interleaving SIOCFORCEOUTBD calls to preserve the original packet boundaries.

r6. Connection is ready now. Let the packets pass through.

CP/M时BIOS做在disk上，IBM将其封在ROM中。导致EFI出现的一个原因：Hard drives still typically have 512 byte sectors, and the MBR partition table used by BIOSes stores sectors in 32-bit variables. Partitions above 2TB? Not really happening.

另外两个Firmware Interface

the ARC standard that appeared on various MIPS and Alpha platforms and Open Firmware, common on PowerPC and SPARCs.

EFI is intended to fulfill the same role as the old PC BIOS. It's a specification that's 2,210 pages long and still depends on the additional 727 pages of the ACPI spec and numerous ancillary EFI specs.

EFI分为两层

（1）At the lowest level is the Pre-EFI Initialization (PEI) code, whose job it is to handle setting up the low-level hardware such as the memory controller. As the entry point to the firmware, the PEI layer also handles the first stages of resume from S3 sleep. 开发人员不要关心这一层

（2）PEI then transfers control to the Driver Execution Environment (DXE). The DXE layer is what's mostly thought of as EFI. It's a hardware-agnostic core capable of loading drivers from the Firmware Volume (effectively a filesystem in flash), providing a standardized set of interfaces to everything that runs on top of it. From here it's a short step to a bootloader and UI, and then you're off out of EFI and you don't need to care any more.

写EFI bootloader如何和EFI打交道：

Devices with bound drivers are represented by handles, and each handle may implement any number of protocols. Protocols are uniquely identified with a GUID. There's a LocateHandle() call that gives you a reference to all handles that implement a given protocol.

Each EFI protocol is represented by a table (ie, a structure) of data and function pointers. There's a couple of special tables which represent boot services (ie, calls that can be made while you're still in DXE) and runtime services (ie, calls that can be made once you've transitioned to the OS), and in turn these are contained within a global system table. The system table is passed to the main function of any EFI application, and walking it to find the boot services table then gives a pointer to the LocateHandle() function.

A from Google's Tom Herbert attacks latency caused by excessive buffering, but its future in its current form is uncertain.

There may be queues within the originating application, in the network protocol code, in the traffic control policy layers, in the device driver, and in the device itself. Patch的目标是device driver.

Any worthwhile network interface will support a ring of descriptors describing packets(可能是256个)，但是

(1) the number of packets is the wrong parameter to use for the size of the queue, 以字节数似乎更合理

(2) the queue length must be a dynamic parameter that responds to the current load on the system. Expecting system administrators to tweak transmit queue lengths manually seems like a losing strategy. 重负载队列可能要长一点，因为不容易填满

Tom's patch adds a new "dynamic queue limits" (DQL) library that is meant to be a general-purpose queue length controller;引入了新的API

 The LIO has been merged.

 now supports IPv6.

 eCryptfs now has support for .

 md now has support for bad block management.

 tools/power/cpupower has been added with tools to monitor power management for multiple architectures, and is eventually slated to replace the Intel-specific tools in tools/power/x86.

Changes visible to kernel developers include:

A watchdog timer driver core has been added.
The SLUB slab allocator no longer requires locks on the fast path for architectures that support cmpxchg.
EFI non-volatile storage can now be used as a backend to persistently store log messages or other information.

2 （重要，改变了SKB的一些操作）

LWN , whose intent is to be sure that pages under I/O cannot be modified (by the kernel or user space) until the I/O completes.

networking I/O也有类似的问题

详细的描述见

问题情景：客户端数据重传，此时服务器ACK到达，客户端结束，buffer（page）释放，buffer重写，修改了重传的页面，重传的数据是错误的（当然服务器可能拒绝，但是无疑这是不好的）。

patch的目的

basically the series allows entities which inject pages into the networking stack to receive a notification when the stack has really finished with those pages (i.e. including retransmissions, clones, pull-ups etc) and not just when the original skb is finished with, which is beneficial to many subsystems which wish to inject pages into the network stack without giving up full ownership of those page's lifecycle.

解决办法：

引入skb_frag_destructor结构和skb_frag_ref()函数，When the reference count (ref) drops to zero, 相应的callback函数被调用唤醒进程.

3 （重要）

针对“Our once approachable and hackable kernel has, over time, become more complex and difficult to understand.”写的一篇文章

This article is about friendship and friendship will rarely tell us how to fix a problem - we usually need to visit a specialist for that. But a friendship with data structures and locking mechanisms can help us identify which code is worthy of a closer inspection just as we have seen in this exploration.

关于RCU

As the atomicity is constrained to a single pointer, two separate pointer accesses - even in the one rcu_read_lock()ed section - will not be coherent. This means it is very important not to dereference the same pointer twice (you might get two different values) and if you need to read two separate pointers, be very careful about assuming any relationship between the two values - there might be one but you need to understand the rest of the code to be sure. 然后指出了.可能存在的一个bug.

“In RCU's case the skill that most people pay for is not the "quick change, always stable" trick that he is so proud of, but the "never stale" promise he has to make to achieve this. Among the several hundred times that RCU is employed by the Linux kernel, a substantial majority simply use RCU as an extra, cheap, reference count to stop things from going stale until they are really not needed. In a number of cases this is very explicit in a "put" function.” 这种情况就不用利用grace period的特色

Seqlock：在失败路径上，不一定要有read_seqretry和read_seqbegin配套，和一般的机制不同。

最后“It's all about teamwork”小节没有看懂

阅读(1129) | 评论(0) | 转发(0) |

上一篇：如果不检查函数返回值

下一篇：lwn.net kernel news 2011/9

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6