lwn.net kernel news 2011/7-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 617500
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2011/7

分类： LINUX

2011-11-04 11:23:03

Matthew Garrett into the subtleties of booting Linux with EFI.

User-visible changes merged for 3.1 include:

Xen has gained a couple of new guest memory management techniques called "self-ballooning" and "frontswap-selfshrinking." Both use transcendent memory to try to improve memory performance and smooth out usage spikes.
The Xen PCI backend driver - allowing the kernel to export PCI devices to guests - has been merged.
The Xen balloon driver now supports memory hotplug.
Finally, Linux has got Xen dom0 support
The networking layer has a new "fanout" feature; using setsockopt(), packets captured from an AF_PACKET socket can be divided among multiple processes. A number of policies describing how packets are "fanned out" are supported.
The ptrace() system call has been augmented with some new commands, starting with PTRACE_SEIZE, which is like PTRACE_ATTACH but does not trap the traced process or change its signal state. PTRACE_INTERRUPT will stop a traced process without creating confusion with signals. PTRACE_LISTEN allows the traced process to receive certain events even though it is in a stopped state. All of these options are considered to be under development; a special PTRACE_SEIZE_DEVEL flag must be provided by user space to acknowledge an understanding that things might change.
The lseek() system call now implements ; these operations can be used to locate extended blocks of zeroes within files.

Changes visible to kernel developers include:

A general-purpose CRC8 generation library has been added.
The networking layer has gained generic support for near-field communication (NFC) devices. See Documentation/networking/nfc.txt for details.
The power management callbacks found in struct dev_pm_ops have been augmented with a whole set of "noirq" versions. The power domains subsystem uses these callbacks for system-wide power transitions.
The check_acl() inode operation has been replaced by get_acl(), whose job is to simply fetch the access control list from disk. Actual checking of ACLs is now done in the core VFS code.

讨论rt-tree对Per-cpu变量的处理

背景:

Safe access to per-CPU data requires a couple of constraints, though: the thread working with the data cannot be preempted and it cannot be migrated while it manipulates per-CPU variables. To avoid these hazards, access to per-CPU variables is normally bracketed with calls to get_cpu_var() and put_cpu_var(); the get_cpu_var() call, along with providing the address for the processor's version of the variable, disables preemption.

目前rt-tree的做法: In the past, this problem has been worked around by protecting per-CPU variables with spinlocks. These locks keep the code preemptable, but they wreck the scalability that per-CPU variables were created to provide and complicate the code.

将来的做法: whenever a process acquires a spinlock or obtains a CPU reference with get_cpu(), the scheduler will refrain from migrating that process to any other CPU. That process remains preemptable - code holding spinlocks can be preempted in the realtime world - but it will not be moved to another processor. 这样做法的前提是假定per-cpu已经有per-cpu lock保护,这样的话需要的改动比以前的小, 而且scalability也好,但前景尚不可知.

4 (重点,未完全消化)

首先Overview of preemptible RCU read-side code, 然后列举了一些bugs和commits

in_irq() can return inaccurate results because it consults the preempt_count() bitmask, which is updated in software. At the start of the interrupt, there is therefore a period of time before preempt_count() is updated to record the start of the interrupt, during which time the interrupt handler has started executing, but in_irq() returns false. Similarly, at the end of the interrupt, there is a period of time after preempt_count() is updated to record the end of the interrupt, during which time the interrupt handler has not completed executing, but again in_irq() returns false. This last is most emphatically the case when the end-of-interrupt processing kicks off softirq handling.

上面的这段话事件用区间[real_irq_begin, in_irq_begin, in_irq_end, real_irq_end]来表示, real_irq 和in_irq有一个偏差,导致in_irq()在中断开始时和中断结束时的判断都是错误的.很多RCU的bug都与此有关.

硬件厂商依然往往只考虑Windows

Matthew Garrett the subtleties of booting Linux with EFI. Once again, hardware vendors are myopically focusing on Windows. "As we've seen many times in the past, the only thing many hardware vendors do is check that Windows boots correctly.”

本来NAT是为了解决IPV4地址不足出现的, 但是另外的需要“People want to hide the details of the topology of their internal networks, therefore we will have NAT with ipv6 no matter what we think or feel.”导致NAT在IPV6中继续存在

中的bug会导致文件丢失, Linus, Al, and Hugh三人合力才解决.

“Our once approachable and hackable kernel has, over time, become more complex and difficult to understand.”

为了改进用户态程序的低级错误(不检查setuid的返回值就认为成功), 主动改进内核的防御.

That led to the , which changed do_execve_common() to return an error (EAGAIN) if the user was over their process limit and removed the check from set_user().setuser是在setuid中调用

以前的努力主要在内核,too invasive to be merged. by Pavel Emelyanov的大部分实现在用户态, 前景不明.

Linus发布的内核将命名为3.0, 而不是3.0.0. stable kernel继续x.y.z风格

常例文章, 下面的趋势要注意:

The percentage of changes from hobbyists continues to drop; whether that's a bad thing (the kernel is becoming increasingly unapproachable to volunteer developers) or a good thing (it's impossible for anybody who can hack the kernel to remain unemployed) is still not clear.

另外,做了两个基于长期数据的统计

The history from the beginning of the 2.5 development series covers about 9.5 years of development. During this time, some 291,664 changesets were contributed by 8,078 developers; those changes added 10.5 million lines of code.

Since 2.6.0, there have been 264,706 changesets contributed by 7,725 developers adding 8.7 million lines of code.

One other exercise with this data seemed interesting: a determination of who have been the most consistent contributors over those nine years and some. After running a script to track which developers contributed to each major release, twelve developers were found who had contributed to all 41 of them.

未解决的老问题, Jonathan Corbet提出利用udev来格式化输出数据的思路

1
The poll(), select(), and epoll_wait() system calls are all implemented with the poll() method in the file_operations structure:

unsigned int (*poll) (struct file *filp, struct poll_table_struct *pt); poll函数返回值表示是否阻塞，如果可能，将加入等待队列到pt。有一个优化措施，如果某个文件poll操作不阻塞的话，余下文件的poll操作pt参数将是NULL.

问题：如果是device file， driver需要知道对它进行的操作以便尽早启动硬件。

解决方案：Hans Verkuil has posted slightly changing the way poll() works.保证driver能够查询pt结构。 With the patch, the poll table is never passed as null; instead, the "we will not be blocking" case is marked internally. So the set of events requested by the application is always available;

2
如何expand the functionality of seccomp依然没有达成一致意见。

3
碰到的老问题， The current CMA mechanism is used as an allocator behind dma_alloc_coherent(), 但是该函数在ARM平台存在mutilpe mapping问题，从而cache attribute 不一致，导致系统行为undefined，见.
目前有两个解决方案，使用high memory（arm上不普遍并且arm实现有特殊的困难）或者unmap low memory（代价是huge page被分成小页面）。

4
问题背景：一个driver实际有多个硬件组成，它们之间的初始化存在依赖关系。
Grant's takes a simple approach to solving this problem: drivers which are unable to initialize their devices as the result of missing resources can request that the operation be retried at some point in the future. That request is a simple matter of returning -EAGAIN from the probe() function.

阅读(675) | 评论(0) | 转发(0) |

上一篇：lwn.net kernel news 2011/6

下一篇：操作系统版本升级后的软件兼容

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6