1 The real BKL end gameBKL预计2.6.39要彻底消失了,Arnd Bergmann released removes the lock itself.2 LCA: Server power managementdata center的现状: 50%电用于计算 50%用于其它(network infrastructure and power supply loss, but the biggest component is air conditioning)。The
best contemporary data centers have been able to reduce their overhead
to about 20% - a big improvement. Cogeneration techniques - using heat
from data centers to warm buildings, for example - can reduce that
overhead even further. A 48-core system, Matthew says, will draw about 350W when it is idle;When
the CPU is working, though, the situation is a bit different; the power
consumption is about 20W per core, or about 960W for a busy 48-core
system. - Linux
is better than any other operating system with regard to CPU power; we
have more time in deep idle states and fewer wakeups than others.
- 目
前的兴趣是memory controller 。 interest is shifting toward memory power
management. If all of the CPUs in a package within the system can be
idled, the associated memory controller will go idle as well. It's also
possible to put memory into "self-refresh" mode if it is idle, reducing
power use while preserving the contents. In other situations, running
memory at a lower clock rate can reduce power usage.
- simply turning a system off 指虚拟机轻载荷时 可以consolidated onto a small number of machines,关掉其他的虚拟机
- TSC
变频问题已经解决 Once upon a time, changing the CPU frequency would change the
rate of the TSC, but that problem has been solved by the CPU vendors for
a few years now.
- lower
frequency没什么用; best results usually come from running at full speed and
spending more time in a sleep state ("C state"). 此外manufacturers have
caused the TSC to run even when the CPU is sleeping.
- Another interesting feature of contemporary CPUs is the ""
mode, which can allow a CPU to run in an overclocked mode for a period
of time. Using this mode can get work done faster, allowing longer
sleeps and better power behavior, but it depends on good power
management if it is to work at all. If a core is to run in turbo mode,
all other cores on the same die must be in a sleep state. The end result
is that turbo mode can give good results for single-threaded workloads.
- Some
effort is going into powering down unused hardware components 。Similar
things can be done with other types of hardware - firewire ports, audio
devices, SD ports, etc.
In
summary: Linux is doing pretty well with regard to enterprise-level
power management; the GPU is the only place where we perform worse than
Windows does.3 Concurrent code and expensive instructionsPaul McKenney 解读了他感兴趣的一篇学术论文Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated 4 A tale of two SCSI targetsLIO and SCST的比较,已经决定了的东西,没什么好看的。1 Bypassing linux-next对两个补丁未通过linux-next直接进入mainline引发的不满2 2.6.38 merge window part 2- The feature has been merged.
- A
new tool called turbostat has been added; it can be used to obtain
various types of performance statistics from Intel processors. Also
added is x86_energy_perf_policy, which can be used to tweak the
performance/power usage tradeoff on Intel CPUs.
- The
kernel can now synchronize its internal time to an external
pulse-per-second (PPS) signal with a high degree of accuracy. The kernel
has also gained the ability to generate (and accept) PPS signals on a
parallel port, assuming one can still find a computer with such a port.
- The x86 architecture can now boot XZ-compressed kernels.
- Basic support for multitouch panels has been added to the human input devices (HID) layer.
- The block I/O bandwidth controller can now be used with hierarchical control groups.
- The block layer has a new.
What that means is that detection of device events (the insertion of an
optical disc, for example) can be done in the drivers, eliminating the
need to poll devices from user space.
- The fallocate() system call can now be used to in the middle of files. Currently this feature is supported by XFS and OCFS2.
Changes visible to kernel developers include:- ktest.pl,
a script which can automate the process of building, testing, and
bisecting kernels, has been added to the tools directory.
- The
"%pK" format specifier can be used to print the value of potentially
sensitive kernel pointers, especially in places like /proc files. The
behavior of this specifier depends on the value of
/proc/sys/kernel/kptr_restrict; a value of zero means that kernel
pointers will be printed as usual, one causes pointers to be printed as
zero for users without CAP_SYSLOG, and two hides the pointers for all
users.
- Some new dentry operations have been added to support automounters within the VFS.
- The fallocate() filesystem callback has been moved from struct inode_operations to struct file_operations.
3 Transparent huge pages in 2.6.38Andrea gave a talk on THP at the with some interesting benchmark results: slides,. 实现细节:- Current
Linux kernels assume that all pages found within a given virtual memory
area (VMA) will be the same size. thus, much of the initial part of the
patch series is dedicated to enabling mixed page sizes within a VMA.
- modifies the page fault handler:先尝试huge page,若成功,同一范围内已分配的small page释放掉。不成功,走原来的路线
- 拆分huge page的时机。(a) swap out 按多个small page 处理 (b)mprotect(), mlock()等
- "khugepaged"
kernel thread will occasionally attempt to allocate a huge page; if it
succeeds, it will scan through memory looking for a place where that
huge page can be substituted for a bunch of smaller pages.
- 局
限 The current patch only works with anonymous pages; the work to
integrate huge pages with the page cache has not yet been done. It also
only handles one huge page size (2MB)
- System administrators have a number of knobs that they can tweak, see Documentation/vm/transhuge.txt for all the details.
4 Reworking disk events handling对于removeable drive,polling是不可避免的。当前的问题:There are a few problems with how polling is done on Linux; these were nicely outlined by Tejun in.
Polling on Linux requires opening the device; this is a somewhat
heavyweight operation which does not naturally line up with other
operations which might wake the processor.Tejun's
patch works by moving the polling into the kernel. That makes the
polling more efficient by removing the need to open the device and by
allowing the kernel to delay polling wakeups until something else is
going on as well. There is a new function added to struct
block_device_operations which should be implemented by drivers: unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing);目前的驱动还未大量使用check_events接口,参见changed = ioctl(drive, CDROM_MEDIA_CHANGED, CDSL_CURRENT);1 The end of dcache_lock非常重大的更新as Nick, out-of-tree filesystems will need some changes to work with new VFS. 2 How not to get a protocol implementation mergedUDPCP
protocol - a UDP-based protocol used for communications between
cellular base stations. UDPCP offers reliable transfer, multicast, and
more. 目前因为只支持IPV4,不支持IPV6,相关的补丁被拒绝,但实际上目前所有的基站都是基于IPV4.3 2.6.38 merge window part 1- The has been merged. This change should yield better interactive response under a number of workloads.
- The has been merged. This tricky code can yield significant performance improvements for some types of filesystem-heavy workloads.
- Kernel
modules are finally loaded with read-only code on the x86 architecture;
data is now non-executable across the entire kernel.
-
is now supported by the networking layer. This feature improves
transmit performance by placing outgoing data on the proper (CPU-local)
queue.
- Support for the batman-adv mesh networking protocol
- The patch set has been merged.
- The ext3 filesystem has gained support for and the FITRIM ioctl().
- Emulation for the Video4Linux1 API has been removed from the kernel;
Changes visible to kernel developers include:- Flags can now be specified for tracepoints with the
macro. The initial flag of interest is TRACE_EVENT_FL_CAP_ANY, which
allows the tracepoint to be used by unprivileged users; this flag has
been applied to the system call tracepoints.
- The perf trace command has been renamed to perf script.
- are now supported by the kernel.
- There is a new capability bit (CAP_SYSLOG) which controls access to the system log.
- The
"timerlist" infrastructure has been added for kernel subsystems which
must manage lists of timers. See for an
overview of the API.
4 The CHOKe packet scheduler比较好的总结文章拥塞控制存在的问题- Some TCP implementations are more dutiful than others when it comes to congestion control.
- An
increasing amount of traffic on the net uses other protocols (UDP in
particular) which do not have congestion control built into them.
- Excessive queue sizes in routers ("") can also disguise congestion problems until it is too late.
An alternative is the ;
CHOKe stands either for "CHOose and Kill" or "CHOose and Keep,"
depending on one's attitude toward the problem. Stephen Hemminger has
recently a CHOKe implementation for LinuxCHOKe
is intended for points where multiple flows come together - routers and
bridges, primarily. The idea behind CHOKe is to keep the length of
transmit queues under control and to penalize flows with excessive
traffic while avoiding the need to maintain any sort of per-flow state. The
key feature of CHOKe - the one which distinguishes it from RED (from
which it is derived) - is the check against a random packet in the
queue. 随机取包进行判断CHOKe is mentioned in in 'The Earliest Deadline First Scheduling for Real-Time Traffic in the Internet' thesis. 5 Extending the use of RO and NX增强保护- The
kernel .rodata segment has been able to be marked read-only since
2.6.16 in early 2006, depending on the setting of CONFIG_DEBUG_RODATA.
In 2.6.25, the kernel .rodata segment was additionally marked NX (i.e.
no-execute), but only for the x86_64 architecture.
- Matthieu Castet‘s revised
:iIf CONFIG_DEBUG_RODATA is set, various sections of the kernel (.text
and .rodata) are page aligned for both their start and end addresses.
The NX bit is set for all pages from the end of the .text (i.e. code)
section to the _end address that marks the end of the kernel's data
section. 特殊情况:Some older systems that use PCI BIOS require that some
pages in the 640K-1M region be executable. There are also some ISA
mappings that require read-write access to that region.the patch just
sets pages in that region to be RW+X on systems where PCI BIOS is used.
The second change simply modifies free_init_pages() to turn on NX for
any pages that are freed that way, so that those pages have to be
explicitly allowed to store executable code when they are reused.
- A related
adds read-only and no-execute flags to the pages used by kernel
modules. The patch splits the module_core and module_init regions into
three parts: code, read-only data, and read-write data. Each of those
parts is page aligned and the page access permissions are set just
before load_module() returns. For the code pieces, RO+X are set, while
the data parts get NX and either RO or RW depending on the type of data.
These changes are all governed by the setting of
CONFIG_DEBUG_SET_MODULE_RONX.
将来 CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to be turned on for most distributions—or to default to "on"1 Paul McKenney's parallel programming book值得关注Paul McKenney has announced the availability of on parallel programming. 2 Gettys: Bufferbloat in 802.11 and 3G NetworksJim Gettys has another post on bufferbloat, this time.3 Announcing the beta release of PowerTOP 2.04 A Linux kernel compatibility layer for FreeBSD? 词汇: KPI [Kernel Programming Interface] Roberson 利用现成InfiniBand stack by the,但是该工作是基于Linux的,所以写了fairly large compatibility layer把Linux的API映射到FreeBSD的API.由此引发了一些讨论。但明显的一点是:the community of FreeBSD users and developers is just not large enough,很多地方跟不上。5 The trouble with firmware立场问题Debian
moveding the non-free firmware out of its main repository for the
upcoming 6.0 ("Squeeze") release. But there are others who find even
that insufficient and would like to see any mention of the non-free
firmware files removed from the kernel. The aims to deliver a completely free (under its definition) Linux distribution.下面如何做引发的一些问题。The plan seems to go far beyond just creating a kernel with obfuscated firmware names, though.6 Who wrote 2.6.37ome
1,140,000 lines of code were added, and 641,000 lines were removed, for
a net growth of 494,000 lines. Most notably, perhaps: the 2.6.37 kernel
includes patches from 1,250 developers, the highest ever.
阅读(1320) | 评论(0) | 转发(0) |