What\'s new in linux kernel 2.6.20 (一)-yishuihe-ChinaUnix博客

Contents

Short overview (for news sites, etc)

2.6.20 makes linux join to the virtualization trends. This release adds two virtualization implementations: A full-virtualization implementation that uses Intel/AMD hardware virtualization capabilities called KVM () and a paravirtualization implementation () that can be used by different hypervisors (Rusty's lguest; Xen and Vmware in the future, etc),. But this release also adds initial Sony Playstation 3 support, a fault injection debugging feature (), UDP-lite support, better per-process IO accounting, relative atime, support for using swap files for suspend users, relocatable x86 kernel support for kdump users, small microoptimizations in x86 (sleazy FPU, regparm, support for the Processor Data Area, optimizations for the Core 2 platform), a generic HID layer, DEEPNAP power savings for PPC970, lockless radix-tree readside, shared pagetables for hugetbl, ARM support for the AT91 and iop13xx processors, full NAT for nf_conntrack and many other things.

Important things (AKA: ''the cool stuff'')

Sony Playstation 3 support

You may like the Wii or the 360 more, but only the PS3 is gaining official Linux support, written by Sony engineers. Notice that the support at this time is incomplete (apparently enabling it will not boot on a stock PS3) and it doesn't support the devices included like the graphics card, etc. , , , , , , , , , , ,

Virtualization support through KVM

KVM () adds a driver for Intel's and AMD's hardware virtualization extensions to the x86 architecture (KVM will not work in CPUs without virtualization capabilities). See the for more information about virtualization in Linux

The driver adds a character device (/dev/kvm) that exposes the virtualization capabilities to userspace. Using this driver, a process can run a virtual machine (a "guest") in a fully virtualized PC containing its own virtual hard disks, network adapters, and display. Each virtual machine is a process on the host; a virtual cpu is a thread in that process. kill(1), nice(1), top(1) work as expected. In effect, the driver adds a third execution mode to the existing two: we now have kernel mode, user mode, and guest mode. Guest mode has its own address space mapping guest physical memory (which is accessible to user mode by mmap()ing /dev/kvm). Guest mode has no access to any I/O devices; any such access is intercepted and directed to user mode for emulation.

32 and 64 bits guests are supported (but not x86-64 guests on x86-32 hosts!). For i386 guests and hosts, both pae and non-pae paging modes are supported. SMP hosts and UP guests are supported, SMP guests aren't (support will be added in the future). You also can start multiple virtual machines in a host. Performance currently is non-stellar, it will be improved by a lot with the future inclusion of KVM support.

The Windows install currently bluescreens due to a problem with the virtual APIC, a fix is being worked on and will be added in future releases. A temporary workaround is to use an existing image or install through qemu - Windows 64-bit does not work either

Paravirtualization support for i386

Paravirtualization is the act of running a guest operating system, under control of a host system, where the guest has been ported to a virtual architecture which is almost like the hardware it is actually running on. This technique allows full guest systems to be run in a relatively efficient manner (continue reading this for more information). This allows to link different hypervisors (lguest/lhype/rustyvisor implements a hypervisor in 6.000 lines; Xen and Vmware will be probably ported to this framework some day). There are limitations like no SMP support yet; this feature will evolve a lot with the time , , , , , , , , , , ,

Relocatable kernel support for x86

This feature (enabled with CONFIG_RELOCATABLE) isn't very noticeable for end-users but it's quite interesting from a kernel POV. Until now, it was a requirement that a i386 kernel was loaded at a fixed memory address in order to work, loading it in a different place wouldn't work. This feature allows to compile a kernel that can be loaded at different 4K-aligned addresses, but always below 1 GB, with no runtime overhead. Kdump users (a feature introduced in that it triggers in a kernel crash in order to boot a kernel that has been previously loaded at a 'empty' address, then runs that kernel, saves the memory where the crashed kernel was placed, dumps it in a file and continues booting the system) will benefit from this because until now the "rescue kernel" need to be compiled with different configuration options in order to make it bootable at a different address. With a relocatable kernel, the same kernel can be boot at different addresses. , , ,

Fault injection

This is a debugging feature that 'injects' failures in several layers in the kernel (kmalloc() failures, alloc_pages() failures, disk IO errors). By 'injecting' them on purpose, a developer can test how their code reacts to errors that are very difficult to find in the real world, where things does not fail so often. For example, a filesystem could not be handling correctly an error triggered by a broken hard disk. Because those error code paths are exercised very rarely the code may contain bugs that could be hit by an user some day. This feature 'injects' those errors on purpose so testing can find bugs much faster. Enabled by the following configuration options: CONFIG_FAILSLAB, CONFIG_PAGE_ALLOC and CONFIG_MAKE_REQUEST. If you also want to configure them via debugfs you must enable CONFIG_FAULT_INJECTION_DEBUG_FS. Here is a about it; and the documentation is here. , , , , , , , ,

IO Accounting

The present per-task IO accounting isn't very useful. It simply counts the number of bytes passed into read() and write(). So if a process reads 1MB from an already-cached file, it is accused of having performed 1MB of I/O, which is 'wrong'. So this IO accounting implements per-process statistics of "storage I/O" (ie: I/O that _really_ does I/O on the storage device - linux already had I/O storage statistics but it's not per-task). The data is reported through taskstats and procfs (/proc/$PID/io) , , , , , , , , ,

Relative atime support

'Atime' is the 'Access time' field of a file: When a process reads a file, its atime is updated. Disabling atime updates, with the 'noatime' mount flag, is probably the most used performance tweak that linux administrators use: An active server is continually reading files, generating lots of atime updates, which translate to metadata updates that the filesystem must write to disk. And writing those updates can seriously damage your performance. Believe it or not, a busy server like kernel.org (vsftpd + apache workload) cut their load average in half just by mounting their filesystems with 'noatime'.

Relative atime ('relatime') only updates the atime if the previous atime is older than the mtime or ctime. It avoids a lot of metadata atime updates (but not all of them, obviously, there's 'noatime' for that). It's like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified. Currently only OCFS2 supports it. A corresponding patch against mount(8) is available . , ocfs2 support

UDP-Lite support

Support for UDP-Lite () for IPv4 and a extension for UDP-Lite over IPv6 is added in 2.6.20. Documentation and programming guide. UDP-Lite is a Standards-Track IETF transport protocol whose characteristic is a variable-length checksum. This has advantages for transport of multimedia (video, VoIP) over wireless networks, as partly damaged packets can still be fed into the codec instead of being discarded due to a failed checksum test

Generic HID layer

Currently the HID layer (Human Interface Device) does only work with USB devices. 2.6.20 turns the USB-oriented HID layer into a generic HID layer that can be used for any subsystem that needs it, like Bluetooth. , , , , , , ,

Sleazy FPU optimization

This is a x86-32 port of the x86-64 feature implemented in . It doesn't gives huge performance except a small improvement in FPU-intensive programs, but it's also a interesting optimization. Right now the kernel has a 100% lazy fpu behavior: after *every* context switch a trap is taken for the first FPU use to restore the FPU context lazily. This is of course great for applications that have very sporadic or no FPU use (since then you avoid doing the expensive save/restore all the time).

However for very frequent FPU users you take an extra trap every context switch. This feature adds a simple heuristic to this code: After 5 consecutive context switches of FPU use, the lazy behavior is disabled and the context gets restored every context switch. If the app indeed uses the FPU, the trap is avoided (the chance of the 6th time slice using FPU after the previous 5 having done so are quite high obviously). After 256 switches, this is reset and lazy behavior is returned (until there are 5 consecutive ones again). The reason for this is to give apps that do longer bursts of FPU use still the lazy behavior back after some time.

Use 'regparm' in x86-32

This is another not-relevant-to-users-yet-interesting-for-geeks feature, that has been available as an option for a while but it's default now. Since forever the x86 architecture has stored the function parameters in the stack. Modern architectures (PPC, SPARC, etc) use registers: It's much faster, since you don't need to do anything to bring the parameters back: The parameters are just there, in the register. The x86 world (including linux) continued using stacks for parameter passing, for compatibility reasons with software, compilers, etc; they only added extensions to compilers to optionally tell the compiler to use parameters for parameter passing in a given function (usually involving the 'fastcall' keyword) for performance-critical paths.

Thanks to a GCC extension, the linux kernel uses the '-mregparm=3' compile option, which means that as long as a function uses 3 or less arguments, GCC will automatically use registers to pass its parameters. And if you're wondering about x86-64, in that platforms using the registers has always been the default

round_jiffies() infrastructure

This is a example of the power savy trend ongoing in the Linux kernel. This feature Introduce the round_jiffies()/round_jiffies_relative() functions. These functions round a jiffies value to the next whole second. The target of this rounding is all the "we don't care exactly when" timers. By rounding these timers to whole seconds, all such timers will fire at the same time, rather than at various times spread out; with dynamic ticks these extra timers cause wakeups from deep sleep CPU sleep states and thus waste power , ,

New drivers

Here are some important drivers that have been added to the Linux tree - note that it says 'drivers', only new important drivers are listed here. There's a lot of device support to the already existing drivers that it's not listed here:

Networking:
- Driver for the Atmel MACB on-chip ethernet module
- Tsi108/9 On Chip Ethernet device driver
- Netxen 1G/10G ethernet driver , ,
Hwmon
- New Winbond W83793 hardware monitoring driver
- New PC87427 hardware monitoring driver
- New AMS hardware monitoring driver
I2C
- New ARM Versatile/Realview bus driver
- New Atmel AT91 bus driver
- New Philips PNX bus driver
Watchdog:
- NS pc87413-wdt Watchdog driver
- MIPS RM9000 on-chip watchdog device driver , , , ,
Input
- Add Philips UCB1400 touchscreen driver
- Add driver for keyboard on AAED-2000 development board (ARM)
Graphics: Fbdev driver for IBM GXT4500P videocards
RTC: rtc-omap driver

Various core changes

Memory management, block layer, etc
- Make the readside of the radix-tree (used in the page-cache) RCU lockless
- Shared page tables for hugetlb ,
- New swap token algorithm. The old algorithm had a crude timeout parameter that was used to handover the token from one task to another. The new algorithm transfers the token to the tasks that are in need of the token. The urgency for the token is based on the number of times a task is required to swap-in pages. Accordingly, the priority of a task is incremented if it has been badly affected due to swap-outs. To ensure that the token doesn't bounce around rapidly, the token holders are given a priority boost. The priority of tasks is also decremented, if their rate of swap-in's keeps reducing
- Memory page_alloc zonelist caching speedup: Optimize the critical zonelist scanning for free pages in the kernel memory allocator by caching the zones that were found to be full recently (in the last second), and skipping them. Benchmarks on a 56-CPU/96GB-RAM systems can be found in the commit link
- fdtable: Implement new pagesize-based fdtable allocator
- Optimize o_direct on block devices
- Support larger block pc requests. Modify blk_rq_map/unmap_user() so that it supports requests larger than bio by chaining them together
- Add numa node information to struct device
- Add 'noaliencache' boot option to disable numa alien caches. When using numa=fake on non-NUMA hardware there is no benefit to having the alien caches, and they consume much memory
Workqueue revamp. The struct work_struct was a bit bloated, so efforts have been done to fix it, resulting in a division between delayable and non-delayable events, and some API changes. See for complete details and for details on how to adapt broken code for the new workqueue API , , ,
TTY: termios revamp, adds proper speed control , , , ,
Generic BUG implementation , , , , ,
Driver core: add API for internal notification of bus events ; show the initialization state(live, coming, going) of the module (cat /sys/module/usbcore/initstate) ; show drivers in /sys/module/ ,
Sysrq: Add new sysrq feature: Sysrq + X: show blocked (TASK_UNINTERRUPTIBLE) tasks.;useful for debugging IO stalls ; add sysrq_always_enabled boot option
Create CONFIG_SYSFS_DEPRECATED , , , ,
Add child reaper to pid_namespace
Allow user processes to raise their oom_adj value
Use softirq for load balancing
LOG2: Implement a general integer log2 facility in the kernel
bit reverse library
Implement prof=sleep profiling. TASK_UNINTERRUPTIBLE sleeps will be taken as a profile hit, and every millisecond spent sleeping causes a profile-hit for the call site that initiated the sleep
kprobes: enable booster on the preemptible kernel
Switch pci_{enable,disable}_device() to be nestable, so that eg, three calls to enable_device() require three calls to disable_device(). The reason for this is to simplify PCI drivers for multi-interface/capability devices. These are devices that cram more than one interface in a single function. A relevant example of that is the Wireless [USB] Host Controller Interface ,