Chinaunix首页 | 论坛 | 博客
  • 博客访问: 579211
  • 博文数量: 197
  • 博客积分: 7001
  • 博客等级: 大校
  • 技术积分: 2155
  • 用 户 组: 普通用户
  • 注册时间: 2005-02-24 00:29












2012-05-15 21:30:15


TechCrunch has .



Google has announced that it has put up a (read-only) mirror at ..




While storage devices are billed as being "random access" in nature, the truth of the matter is that operations to some parts of the device can be faster than operations to others. Rotating storage has a larger speed differential than flash, while hybrid devices may show a large difference indeed. Given that differences exist, it is natural to want to place more frequently-accessed data on the faster part of the device.



The , as posted by Ted Ts'o, is to create a couple of new flags to be provided by applications at the time a file is created. A file expected to be accessed frequently would be created with O_HOT, while a file that will see traffic only rarely would be marked with O_COLD. It is assumed that the filesystem would, if possible, place O_HOT files in the fastest part of the underlying device




Al Viro的解决方案:

* switch to fget_light/fput_light where possible; it's not needed for the rest, but is useful anyway

       * move the guts of filp_close() (everything prior to fput() it does in the end) into a new helper, turning filp_close() into a couple of calls (inlined, at that).  Equivalent transformation. turning risky fput() calls into an asynchronous operation running in a separate thread. But there is no knowledge of locking rules added to fput(); instead, the situation is avoided altogether whenever possible, and all remaining calls are done asynchronously.但是要保证返回用户态前要完成所有的异步调用以保证close的语义正确

       * fput() in its current form is renamed to fput_nodefer(); does the same thing fput() does now.




Paul Moore has , which is meant to make it easier for applications to take advantage of the packet-filter-based seccomp mode. That will lead to more secure applications that can permanently reduce their ability to make "unsafe" system calls, which can only be a good thing for Linux application security overall. Chromium browser等需要该功能。




The fallocate() system call can be used to increase the size of a file without actually writing to the new blocks. It is useful as a way to encourage the kernel to lay out the new blocks contiguously on disk, or just to ensure that sufficient space is available before beginning a complex operation. Filesystems implementing fallocate() take care to note that the new blocks have not actually been written; attempts to read those uninitialized blocks will normally just return zeroes. To do otherwise would be to risk disclosing information remaining in blocks recently freed from other files.

问题:当random write时,写少量数据将导致整个未初始化的extent初始化,性能受影响。

一个patchZheng Liu recently implemented (called FALLOC_FL_NO_HIDE_STALE) that marks new blocks as being initialized, even though the filesystem has not actually written them; these blocks, will thus contain random old data.但是可能会导致安全问题,前景不明




recently posted by Kay.

The patch does a few independent things - a cause for a bit of complaining on the mailing list. The first of these is to change the kernel's internal log buffer from a stream of characters into a series of records. Each message is stored into the buffer with a header containing its length, sequence number, facility number, and priority. In the end, Kay says, the space consumed by messages does not grow; indeed, it may shrink a bit.

The second change is to allow the addition of a facility number and a "dictionary" containing additional information that, most likely, will be of interest to automated parsers.

Finally, the patch changes the appearance of log messages when they reach user space.



stable kernel接收的patch必须先进入mainline

The "mainline first" rule takes advantage of this network of users to ensure that fixes are applied for the long term and not just for a specific stable series. At the cost of (occasionally) making users wait a short while for a fix, it ensures that they will not need the same fix again in the future and helps to make the kernel less buggy in general.


4 LTTng 2.0: Tracing for power users and developers - part 2




2 A new security subsystem wiki

kernel-related security development information can now be found on .


Juri has posted to restart the discussion. there is a new git repository, an application designed to test deadline scheduling, and .


In a system supporting containers, any globally-visible resource must be wrapped in a namespace layer that provides each container with its own view. There are many such resources on a Linux system: process IDs, filesystems, and network interfaces, for example. Even the system name and time can differ from one container to the next.


the latest piece is from Eric Biederman. The "user namespace" can be thought of as the encapsulation of user/group IDs and associated privilege; it allows the owner of a container to run as the root user within that container while isolating the rest of the system from the in-container users.



It has been that "Linux is evolution, not intelligent design".的说明,Android "timed gpio"被发现"input" subsystem中的rumble已经提供了类似框架。


6 LTTng 2.0: Tracing for power users and developers - part 1

Linux Trace Toolkit next generation (LTTng) 2.0 tracer is the result of a two-year development cycle involving a team of dedicated developers.It can be installed on a vanilla or distribution kernel without any patches.


LTTng provides an integrated interface for both kernel and user-space tracing. A "tracing" group allows non-root users to control tracing and read the generated traces. It is multi-user aware, and allows multiple concurrent tracing sessions.


LTTng allows access to tracepoints, function tracing, CPU PMU counters, kprobes, and kretprobes. It provides the ability to attach "context" information to events in the trace (e.g. any PMU counter, process and thread ID, container-aware virtual PIDs and TIDs, process name, etc). All the extra information fields to be collected with events are optional, specified on a per-tracing-session basis (except for timestamp and event id, which are mandatory). It works on mainline kernels (2.6.38 or higher) without any patches.



The Open Source Automation Development Lab has posted celebrating a full year's worth of testing of latencies on several systems running the realtime preemption kernel. OSADL is an industry consortium dedicated to encouraging the development and use of Linux in automated systems.



l         The device mapper target has been merged. This target manages a read-only device where all blocks are checked against a cryptographic hash maintained elsewhere; it thus provides a certain degree of tampering detection. Details can be found in Documentation/device-mapper/verity.txt

l          Support for the has been merged into the kernel. Getting support into the compiler and the C library is an ongoing project, and the creation of distributions using this ABI will take even longer, but the foundation, at least, is now in place.

l         The "high-speed synchronous serial interface" (HSI) framework has been merged. HSI is an interface that is mainly used to connect processors with cellular modem engines; it will be used for handset support in future kernel releases.

l         The "common clock framework" unifies the handling of subsystem clocks, especially on the ARM architecture (though it is not limited to ARM). See Documentation/clk.txt for more information.




Ballooning for transparent huge pages

如何保持VM可使用的huge page数量,Van Riel's solution requires that balloon pages become movable within the guest, which requires changes to both the balloon driver and potentially the hypervisor. Once that is established, it would also be nice to keep balloon pages within the same 2M regions.


Finding holes for mmap()

problem of finding free virtual areas quickly during mmap() calls. Very simplistically, an mmap() requires a linear search of the virtual address space by virtual memory area (VMA) with some minor optimizations for caching holes and scan pointers. However, there are some workloads that use thousands of VMAs so this scan becomes expensive. 可能修改RB tree


Kernel interference

Christoph Lameter started by stating that each kernel upgrade resulted in slowdowns for his target applications (which are for high-speed trading).

One possible measure would be to isolate OS activities to a subset of CPUs possibly including interrupt handling. 即有些CPU纯跑用户态程序。


Copy offload

copy offload, which is a method for allowing SCSI devices to copy ranges of blocks without involving the host operating system. Copy offload is designed to be a lot faster for large files because wire speed is no longer the limiting factor. In fact, in spite of the attention now, offloaded copy has been in SCSI standards in some form or other since the SCSI-1 days.



Flash media


James Bottomley asked if there were reasons that filesystems should start looking at storing long-lived and short-lived data separately and not mixing the two. Sprouse said that may eventually be needed. He said there is a trend toward hybrid architectures that have small amounts of high-endurance (i.e. can handle many more write cycles) flash and much larger amounts of low-endurance flash. Filesystems may want to take advantage of that by storing things like the journal in the high-endurance portion, and more stable OS files in the low-endurance area. Or storing hot data on high-endurance and cold data on low-endurance. How that will be specified is not determined, however.


Issues with mmap_sem

the worst mmap_sem hold times, such as when a mapped file is accessed and the atime must be updated or when a threaded application is scanning files and hammering mmap_sem. The user visible effects of this can be embarrassing. For example, ps can stall for long periods of time if a process is stalled on mmap_sem which makes it difficult to debug a machine that is responding poorly. 可能会remove mmap_sem


阅读(899) | 评论(0) | 转发(0) |