lwn.net kernel news 2010/11-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621446
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2010/11

分类： LINUX

2011-02-15 12:10:06

1
Unlike CMA, the big chunk allocator does not rely on setting aside memory at boot time. Instead, it will attempt to organize a suitable chunk of memory at allocation time by moving other pages around. Over time, the memory compaction and page migration mechanisms in the kernel have gotten better and memory sizes have grown. So it is more feasible to think that this kind of large allocation might be more possible than it once was.
2 A collection of tracing topics

The tracing ABI

Ingo 坚持 We'll need to embark on this incremental path instead of a rewrite-the-world thing. 而不是重新开发一个全新的ABI

Stable tracepoints

Ingo to the concept of marking some tracepoints as stable

trace_printk()

It can be called like printk() (though without a logging level), but its output does not go to the system log; instead, everything printed via this path goes into the tracing stream as seen by ftrace. When tracing is off, trace_printk() calls have no effect. When tracing is enabled, instead, trace_printk() data can be made available to a developer with far less overhead than normal printk() output.

Unprivileged tracepoints
access to tracepoints is currently limited to privileged users. Frederic Weisbecker has posted which makes that possible.

3 An alternative to suspend blockers
对目前Andriod的PM机制很好的总结，材料丰富，值得一看.
早期的wakelock需要用户态程序员的合作.
cpuidle机制不够充分.
cpuidle-based system power management may not be sufficient to save as much energy as opportunistic suspend on the same system.
4 Ghosts of Unix past, part 4: High-maintenance designs
非常好的系列文章，值得一看
The bible the road to destruction as wide, while the road to life is narrow and hard to find.
本身没有什么大问题，但是和系统的其他部分无缝衔接代价很大
"high maintenance" designs work perfectly well and do exactly what is required. However they do not fit seamlessly into their surroundings and, while they may not actually leave disaster in their wake, they do impose a high cost on other parts of the system as a whole.

Setuid

如下四点：

The most obvious problem comes from the inherited environment. All libraries and all setuid programs need to be particularly suspicious of anything in the environment, and often need to explicitly ignore the environment when running setuid.
An example of a more general conflict comes from the combination of setuid with executable shell scripts. This did not apply at the time that setuid was first invented.
the signal delivery mechanism needs special handling for SIGCONT, simply because of the existence of setuid.
When writing to a file, Linux (like various flavors of Unix) checks if the file is setuid and, if so, clears the setuid flag.
Filesystem capabilities 没有本质上解决该问题，而且实现和相应的工具不一定跟得上
The plan for Fedora 15 is to use filesystem capabilities instead of full setuid. This isn't really a different mechanism, just a slightly reworked form of the original. Setuid stores just one bit per file which (together with the UID) determines the capabilities that the program will have. In the case of setuid to root, this is an all or nothing approach. Filesystem capabilities store more bits per file and allow different capabilities to be individually selected, so a program that does not need all of the capabilities of root will not be given them.

Filesystem links

这里指hard links，有许多令人不满意的地方. Plan 9 不支持.
tar du等工具都要特别考虑hard link.
Anyone who can read a file can create a link to that file which the owner of the file may not be able to remove.
Editors need to take special care of linked files. It is generally safer to create a new file and rename it over the original rather than to update the file in place. When a file has multiple hard links it is not possible to do this without breaking that linkage, which may not always be desired.
Hard links would also make it awkward to reason about any name-based access control approach.

Harken to the ghosts

技术并不能保证成功

Unfortunately, mere technical excellence is no guarantee of success. As Paul McKenney , at the 2010 Kernel Summit, economic opportunity is at least an equal reason for success, and is much harder to come by.
The alternative is to live with our mistakes and attempt to minimize their ongoing impact, deprecating that which cannot be discarded.

1
'trace' is our shot at improving the situation: it aims at providing a simple to use and straightforward tracing tool based on the perf infrastructure and on the well-known perf profiling workflow
2 Simple user-space tracing
. Ingo Molnar has posted . It is currently implemented as an extension to the prctl() system call which allows an application to inject tracing data into the kernel,
3 Punching holes in files
The XFS and OCFS2 filesystems currently have the ability to "punch a hole" in a file - a portion of the file can be marked as unwanted and the associated storage released. Josef Bacik, noting that this capability may be added to other filesystems in the near future, came to the conclusion that the kernel should offer a standard interface for hole punching. The result is adding that ability.
4 TTY-based group scheduling
相当多的评论，或许应该看一下.
Groups are thus a nice feature, but they have not seen heavy use since they were merged for the 2.6.24 release. The reasons for that are clear: groups require administrative work and root privileges to set up; most users do not know how to tweak the knobs and would really rather not learn. What has been missing all these years is a way to make group scheduling "just work" for ordinary users. That is the goal of .

In short, this patch automatically creates a group attached to each TTY in the system. All processes with a given TTY as their controlling terminal will be placed in the appropriate group; the group scheduling code can then share time between groups of processes as determined by their controlling terminals.

反对者认为应该在用户态做，个人支持.
Lennart Poettering that "Binding something like this to TTYs is just backwards"; he would rather see something which is based on sessions. And, he said, all of this could better be done in user space.
5 The media controller subsystem
contemporary video devices are not just frame grabbers anymore. That complexity is revealing limitations in the kernel's device model, prompting the proposal of a new "media controller" abstraction.

Video acquisition devices have never been entirely simple. Even a minimal camera device will usually be a composite of at least three distinct devices: a sensor, a DMA bridge to move frames between the sensor and main memory, and an I2C bus dedicated to controlling the sensor. Most devices coming onto the market now are more sophisticated than that.

各个部件可以独立运作，也可以协同工作，还可以和其它子系统的设备协作.当前的V4L2 system 和Linux device model没有考虑到.

The patch creates a new media_device type which has the responsibility of managing the various components which make up a media-related device. These components are called "entities"; and they can take many forms. Sensors, DMA engines, video processing units, focus controllers, audio devices, and more are all considered to be "entities" in this scheme.

Most entities will have at least one "pad," being a logical connection point where data can flow into or out of the device. "Data" in this sense can be multimedia data, but it might also be a control stream. Pads are exclusively input ("sink") or output ("source") ports, and an entity can have an arbitrary number of each. The final piece is called a "link"; it is a directional connection from a source pad to a sink. Links are created by the media device driver, but they can, in some cases, be enabled or disabled from user space.

As an aside: entities also have a "group" number assigned to them; groups are intended to indicate hardware which is meant to function together. All of the units described above would probably be placed into the same group by the driver.

the problem may be bigger than just media devices,其它部分也有类似问题.

6 Making attacks a little harder
make it harder for them to obtain information which could be used to compromise the kernel.
removing world-read access from /proc/kallsyms， read-protect System.map as well.So an attacker does not need to read /proc/kallsyms or System.map if the target system is running a stock kernel; they need only dig up a package file containing the needed information.

when the kernel exposes pointer values to user space, it gives information to potential attackers. These values can be found in a number of places, including the system log and numerous places in /proc.Dan has posted adding a new sysctl knob controlling access to the syslog() system call.

The proposal is simple: as much of the kernel should be read-only as possible, most especially function pointers and other execution control points, which are the easiest target to exploit when an arbitrary kernel memory write becomes available to an attacker.

不允许自动加载模块
It seems clear that a kernel which never allows users to trigger the loading of modules is less likely to be affected by any vulnerability which is found in a loadable module.

开发自动查找漏洞工具 One technique which can help in this regard is "fuzzing," the process of passing random values into system calls and looking for unexpected behavior.

1 Embedded Linux Flag Version
嵌入式社区的共同协调版本
2.6.35 will be the first embedded flag version, and it will be supported by (at least) Sony, Google, MeeGo, and Linaro. "First, it should be explained what having a flag version means. It means that suppliers and vendors throughout the embedded industry will be encouraged to use a particular version of the kernel for software development, integration and testing. Also, industry and community developers agree to work together to maintain a long-term stable branch of the flag version of the kernel (until the next flag version is declared), in an effort to share costs and improve stability and quality."
2 Linaro 10.11 released
3 Netoops
is a simple driver which will, in response to a kernel oops, collect the most recent kernel logs and deliver them to a server across the net.
4 Checkpoint/restart: it's complicated
过于复杂，短期难以进入mainline.目前还有一个用户态实现项目，各有利弊.
5 ELCE: Grant Likely on device trees
有些SoC不能枚举设备，hard coded显然不如用device tree组织有效，问题是为什么PPC,X86也支持呢？
A device tree represents the devices that are part of particular system, such that it can be passed to the kernel at boot time, and the kernel can initialize and use those devices. For architectures that don't use device trees, C code must be written to add all of the different devices that are present in the hardware.

"going data-driven to describe our platforms is the right thing to do". There is proof that it works in the x86 world as "that's how it's been done for a long time".
6 A more detailed look at kernel regressions
Wysocki handled the regression tracking himself, but it is now a three-person operation, with Maciej Rutecki turning email regression reports into kernel bugzilla entries, and Florian Mickler maintaining the regression entries: marking those that have been fixed, working with the reporters to determine which have been fixed, and so on.

Regressions for a particular kernel release are tracked through the following two development cycles. For example, when 2.6.36 was released, the tracking of 2.6.34 regressions ended. That doesn't mean that any remaining regressions have magically been fixed, of course, and they can still be tracked using the meta-bug associated with a release.

Kernel	# reports	# pending
2.6.26	180	1
2.6.27	144	4
2.6.28	160	10
2.6.29	136	12
2.6.30	177	21
2.6.31	146	20
2.6.32	133	28
2.6.33	116	18
2.6.34	119	15
2.6.35	63	28
Total	1374	157
Reported and pending regressions

The lifetime of regressions：The average for the earlier kernels is 24.4 days, while the later kernels have an average of 32.3 days. 原因：后面rc版本一周内被修订的regression不统计.

Regressions by subsystem： those subsystems that are "closer" to the hardware tend to have more regressions.

对IPV6 设计人员的讽刺

i have theorized in the past that the problem we face is that an insufficient number of axe murderers are attending those kinds of research meetings.

-- on IPv6
1 The 2010 Kernel Summit
一大堆专题
a Checkpoint/restart allows the state of a set of processes to be saved to persistent storage, then restarted at some future time, possibly on a different system.

b Linux at NASDAQ
意想不到的开销
The consensus in the room was that the biggest piece of wakeup overhead is saving and restoring the floating-point unit status. The exchange doesn't do floating-point, of course, but the FPU covers a lot more than basic number crunching anymore. In particular, are used to implement memcpy() in glibc, and SSE use will force a save/restore.

Asynchronous network I/O remains high on his list; 10G Ethernet cards are out there, and 40G is not that far away. That kind of interface can generate data rates that are seriously difficult for the system to keep up with. So they are looking at a number of techniques adopted by the InfiniBand industry: separating control and data paths, bypassing the kernel for data streams, etc. There is a lot of pressure to be able to keep up with these data rates; the kernel will have to do something to reduce network stack overheads and make it possible.

2 The second half of the 2.6.37 merge window

It is now possible to build a generally useful kernel without the BKL
The ext4 filesystem now supports "lazy inode table initialization"
The file_system_type structure has a new mount() function which is meant to replace get_sb().

阅读(844) | 评论(0) | 转发(0) |

上一篇：lwn.net kernel news 2010/10

下一篇：lwn.net kernel news 2011/1

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6