Chinaunix首页 | 论坛 | 博客
  • 博客访问: 565374
  • 博文数量: 197
  • 博客积分: 7001
  • 博客等级: 大校
  • 技术积分: 2155
  • 用 户 组: 普通用户
  • 注册时间: 2005-02-24 00:29












2011-02-20 22:56:12

1 The real BKL end game
BKL预计2.6.39要彻底消失了,Arnd Bergmann released removes the lock itself.

2 LCA: Server power management
data center的现状:
50%电用于计算 50%用于其它(network infrastructure and power supply loss, but the biggest component is air conditioning)。
The best contemporary data centers have been able to reduce their overhead to about 20% - a big improvement. Cogeneration techniques - using heat from data centers to warm buildings, for example - can reduce that overhead even further.

A 48-core system, Matthew says, will draw about 350W when it is idle;
When the CPU is working, though, the situation is a bit different; the power consumption is about 20W per core, or about 960W for a busy 48-core system.
  • Linux is better than any other operating system with regard to CPU power; we have more time in deep idle states and fewer wakeups than others.
  • 目 前的兴趣是memory controller 。 interest is shifting toward memory power management. If all of the CPUs in a package within the system can be idled, the associated memory controller will go idle as well. It's also possible to put memory into "self-refresh" mode if it is idle, reducing power use while preserving the contents. In other situations, running memory at a lower clock rate can reduce power usage.
  • simply turning a system off  指虚拟机轻载荷时 可以consolidated onto a small number of machines,关掉其他的虚拟机
  • TSC 变频问题已经解决 Once upon a time, changing the CPU frequency would change the rate of the TSC, but that problem has been solved by the CPU vendors for a few years now.
  • lower frequency没什么用; best results usually come from running at full speed and spending more time in a sleep state ("C state").  此外manufacturers have caused the TSC to run even when the CPU is sleeping.
  • Another interesting feature of contemporary CPUs is the "" mode, which can allow a CPU to run in an overclocked mode for a period of time. Using this mode can get work done faster, allowing longer sleeps and better power behavior, but it depends on good power management if it is to work at all. If a core is to run in turbo mode, all other cores on the same die must be in a sleep state. The end result is that turbo mode can give good results for single-threaded workloads.
  • Some effort is going into powering down unused hardware components 。Similar things can be done with other types of hardware - firewire ports, audio devices, SD ports, etc.

In summary: Linux is doing pretty well with regard to enterprise-level power management; the GPU is the only place where we perform worse than Windows does.

3 Concurrent code and expensive instructions
Paul McKenney 解读了他感兴趣的一篇学术论文
Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated

4 A tale of two SCSI targets
LIO and SCST的比较,已经决定了的东西,没什么好看的。

1  Bypassing linux-next

2 2.6.38 merge window part 2
  • The feature has been merged.
  • A new tool called turbostat has been added; it can be used to obtain various types of performance statistics from Intel processors. Also added is x86_energy_perf_policy, which can be used to tweak the performance/power usage tradeoff on Intel CPUs.
  • The kernel can now synchronize its internal time to an external pulse-per-second (PPS) signal with a high degree of accuracy. The kernel has also gained the ability to generate (and accept) PPS signals on a parallel port, assuming one can still find a computer with such a port.
  • The x86 architecture can now boot XZ-compressed kernels.
  • Basic support for multitouch panels has been added to the human input devices (HID) layer.
  • The block I/O bandwidth controller can now be used with hierarchical control groups.
  • The block layer has a new. What that means is that detection of device events (the insertion of an optical disc, for example) can be done in the drivers, eliminating the need to poll devices from user space.
  • The fallocate() system call can now be used to in the middle of files. Currently this feature is supported by XFS and OCFS2.

Changes visible to kernel developers include:
  •, a script which can automate the process of building, testing, and bisecting kernels, has been added to the tools directory.
  • The "%pK" format specifier can be used to print the value of potentially sensitive kernel pointers, especially in places like /proc files. The behavior of this specifier depends on the value of /proc/sys/kernel/kptr_restrict; a value of zero means that kernel pointers will be printed as usual, one causes pointers to be printed as zero for users without CAP_SYSLOG, and two hides the pointers for all users.
  • Some new dentry operations have been added to support automounters within the VFS.
  • The fallocate() filesystem callback has been moved from struct inode_operations to struct file_operations.

3 Transparent huge pages in 2.6.38
Andrea gave a talk on THP at the with some interesting benchmark results: slides,.
  • Current Linux kernels assume that all pages found within a given virtual memory area (VMA) will be the same size. thus, much of the initial part of the patch series is dedicated to enabling mixed page sizes within a VMA.
  • modifies the page fault handler:先尝试huge page,若成功,同一范围内已分配的small page释放掉。不成功,走原来的路线
  • 拆分huge page的时机。(a) swap out 按多个small page 处理 (b)mprotect(), mlock()等
  • "khugepaged" kernel thread will occasionally attempt to allocate a huge page; if it succeeds, it will scan through memory looking for a place where that huge page can be substituted for a bunch of smaller pages.
  • 局 限 The current patch only works with anonymous pages; the work to integrate huge pages with the page cache has not yet been done. It also only handles one huge page size (2MB)
  • System administrators have a number of knobs that they can tweak, see Documentation/vm/transhuge.txt for all the details.

4 Reworking disk events handling
对于removeable drive,polling是不可避免的。

There are a few problems with how polling is done on Linux; these were nicely outlined by Tejun in. Polling on Linux requires opening the device; this is a somewhat heavyweight operation which does not naturally line up with other operations which might wake the processor.

Tejun's patch works by moving the polling into the kernel. That makes the polling more efficient by removing the need to open the device and by allowing the kernel to delay polling wakeups until something else is going on as well. There is a new function added to struct block_device_operations which should be implemented by drivers:
  unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing);


changed = ioctl(drive, CDROM_MEDIA_CHANGED, CDSL_CURRENT);

1 The end of dcache_lock
as Nick, out-of-tree filesystems will need some changes to work with new VFS.

2 How not to get a protocol implementation merged
UDPCP protocol - a UDP-based protocol used for communications between cellular base stations. UDPCP offers reliable transfer, multicast, and more.

3 2.6.38 merge window part 1
  • The has been merged. This change should yield better interactive response under a number of workloads.
  • The has been merged. This tricky code can yield significant performance improvements for some types of filesystem-heavy workloads.
  • Kernel modules are finally loaded with read-only code on the x86 architecture; data is now non-executable across the entire kernel.
  • is now supported by the networking layer. This feature improves transmit performance by placing outgoing data on the proper (CPU-local) queue.
  • Support for the batman-adv mesh networking protocol
  • The patch set has been merged.
  • The ext3 filesystem has gained support for and the FITRIM ioctl().
  • Emulation for the Video4Linux1 API has been removed from the kernel;

Changes visible to kernel developers include:
  • Flags can now be specified for tracepoints with the macro. The initial flag of interest is TRACE_EVENT_FL_CAP_ANY, which allows the tracepoint to be used by unprivileged users; this flag has been applied to the system call tracepoints.
  • The perf trace command has been renamed to perf script.
  • are now supported by the kernel.
  • There is a new capability bit (CAP_SYSLOG) which controls access to the system log.
  • The "timerlist" infrastructure has been added for kernel subsystems which must manage lists of timers. See for an overview of the API.

4 The CHOKe packet scheduler
  • Some TCP implementations are more dutiful than others when it comes to congestion control.
  • An increasing amount of traffic on the net uses other protocols (UDP in particular) which do not have congestion control built into them.
  • Excessive queue sizes in routers ("") can also disguise congestion problems until it is too late.

An alternative is the ; CHOKe stands either for "CHOose and Kill" or "CHOose and Keep," depending on one's attitude toward the problem. Stephen Hemminger has recently a CHOKe implementation for Linux

CHOKe is intended for points where multiple flows come together - routers and bridges, primarily. The idea behind CHOKe is to keep the length of transmit queues under control and to penalize flows with excessive traffic while avoiding the need to maintain any sort of per-flow state.

The key feature of CHOKe - the one which distinguishes it from RED (from which it is derived) - is the check against a random packet in the queue. 随机取包进行判断

CHOKe is mentioned in in 'The Earliest Deadline First Scheduling for Real-Time Traffic in the Internet' thesis.

5 Extending the use of RO and NX
  • The kernel .rodata segment  has been able to be marked read-only since 2.6.16 in early 2006, depending on the setting of CONFIG_DEBUG_RODATA. In 2.6.25, the kernel .rodata segment was additionally marked NX (i.e. no-execute), but only for the x86_64 architecture.
  • Matthieu Castet‘s revised :iIf CONFIG_DEBUG_RODATA is set, various sections of the kernel (.text and .rodata) are page aligned for both their start and end addresses. The NX bit is set for all pages from the end of the .text (i.e. code) section to the _end address that marks the end of the kernel's data section. 特殊情况:Some older systems that use PCI BIOS require that some pages in the 640K-1M region be executable. There are also some ISA mappings that require read-write access to that region.the patch just sets pages in that region to be RW+X on systems where PCI BIOS is used. The second change simply modifies free_init_pages() to turn on NX for any pages that are freed that way, so that those pages have to be explicitly allowed to store executable code when they are reused.
  • A related adds read-only and no-execute flags to the pages used by kernel modules. The patch splits the module_core and module_init regions into three parts: code, read-only data, and read-write data. Each of those parts is page aligned and the page access permissions are set just before load_module() returns. For the code pieces, RO+X are set, while the data parts get NX and either RO or RW depending on the type of data. These changes are all governed by the setting of CONFIG_DEBUG_SET_MODULE_RONX.

将来 CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to be turned on for most distributions—or to default to "on"

1 Paul McKenney's parallel programming book
Paul McKenney has announced the availability of on parallel programming.

2 Gettys: Bufferbloat in 802.11 and 3G Networks
Jim Gettys has another post on bufferbloat, this time.

3 Announcing the beta release of PowerTOP 2.0

4 A Linux kernel compatibility layer for FreeBSD?
词汇: KPI [Kernel Programming Interface]

Roberson 利用现成InfiniBand stack by the,但是该工作是基于Linux的,所以写了fairly large compatibility layer把Linux的API映射到FreeBSD的API.

由此引发了一些讨论。但明显的一点是:the community of FreeBSD users and developers is just not large enough,很多地方跟不上。

5 The trouble with firmware
Debian moveding the non-free firmware out of its main repository for the upcoming 6.0 ("Squeeze") release. But there are others who find even that insufficient and would like to see any mention of the non-free firmware files removed from the kernel. The aims to deliver a completely free (under its definition) Linux distribution.

The plan seems to go far beyond just creating a kernel with obfuscated firmware names, though.

6 Who wrote 2.6.37
ome 1,140,000 lines of code were added, and 641,000 lines were removed, for a net growth of 494,000 lines. Most notably, perhaps: the 2.6.37 kernel includes patches from 1,250 developers, the highest ever.
阅读(1157) | 评论(0) | 转发(0) |