1
Unlike
CMA, the big chunk allocator does not rely on setting aside memory at
boot time. Instead, it will attempt to organize a suitable chunk of
memory at allocation time by moving other pages around. Over time, the
memory compaction and page migration mechanisms in the kernel have
gotten better and memory sizes have grown. So it is more feasible to
think that this kind of large allocation might be more possible than it
once was.
2 A collection of tracing topics
Ingo 坚持 We'll need to embark on this incremental path instead of a rewrite-the-world thing. 而不是重新开发一个全新的ABI
Ingo to the concept of marking some tracepoints as stable
It can be called like printk() (though without a logging level), but
its output does not go to the system log; instead, everything printed
via this path goes into the tracing stream as seen by ftrace. When
tracing is off, trace_printk() calls have no effect. When tracing is
enabled, instead, trace_printk() data can be made available to a
developer with far less overhead than normal printk() output.
Unprivileged tracepoints
access to tracepoints is currently limited to privileged users. Frederic Weisbecker has posted which makes that possible.
3 An alternative to suspend blockers
对目前Andriod的PM机制很好的总结,材料丰富,值得一看.
早期的wakelock需要用户态程序员的合作.
cpuidle机制不够充分.
cpuidle-based system power management may not be sufficient to save as much energy as opportunistic suspend on the same system.
4 Ghosts of Unix past, part 4: High-maintenance designs
非常好的系列文章,值得一看
The bible the road to destruction as wide, while the road to life is narrow and hard to find.
本身没有什么大问题,但是和系统的其他部分无缝衔接代价很大
"high
maintenance" designs work perfectly well and do exactly what is
required. However they do not fit seamlessly into their surroundings
and, while they may not actually leave disaster in their wake, they do
impose a high cost on other parts of the system as a whole.
如下四点:
The
most obvious problem comes from the inherited environment. All
libraries and all setuid programs need to be particularly suspicious of
anything in the environment, and often need to explicitly ignore the
environment when running setuid.
An example of a more general
conflict comes from the combination of setuid with executable shell
scripts. This did not apply at the time that setuid was first invented.
the signal delivery mechanism needs special handling for SIGCONT, simply because of the existence of setuid.
When writing to a file, Linux (like various flavors of Unix) checks if the file is setuid and, if so, clears the setuid flag.
Filesystem capabilities 没有本质上解决该问题,而且实现和相应的工具不一定跟得上
The
plan for Fedora 15 is to use filesystem capabilities instead of full
setuid. This isn't really a different mechanism, just a slightly
reworked form of the original. Setuid stores just one bit per file which
(together with the UID) determines the capabilities that the program
will have. In the case of setuid to root, this is an all or nothing
approach. Filesystem capabilities store more bits per file and allow
different capabilities to be individually selected, so a program that
does not need all of the capabilities of root will not be given them.
这里指hard links,有许多令人不满意的地方. Plan 9 不支持.
tar du等工具都要特别考虑hard link.
Anyone who can read a file can create a link to that file which the owner of the file may not be able to remove.
Editors
need to take special care of linked files. It is generally safer to
create a new file and rename it over the original rather than to update
the file in place. When a file has multiple hard links it is not
possible to do this without breaking that linkage, which may not always
be desired.
Hard links would also make it awkward to reason about any name-based access control approach.
技术并不能保证成功
Unfortunately, mere technical excellence is no guarantee of success. As Paul McKenney , at the 2010 Kernel Summit, economic opportunity is at least an equal reason for success, and is much harder to come by.
The alternative is to live with our mistakes and attempt to minimize
their ongoing impact, deprecating that which cannot be discarded.
1
'trace'
is our shot at improving the situation: it aims at providing a simple
to use and straightforward tracing tool based on the perf infrastructure
and on the well-known perf profiling workflow2 Simple user-space tracing
. Ingo Molnar has posted .
It is currently implemented as an extension to the prctl() system call
which allows an application to inject tracing data into the kernel,
3 Punching holes in files
The
XFS and OCFS2 filesystems currently have the ability to "punch a hole"
in a file - a portion of the file can be marked as unwanted and the
associated storage released. Josef Bacik, noting that this capability
may be added to other filesystems in the near future, came to the
conclusion that the kernel should offer a standard interface for hole
punching. The result is adding that ability.
4 TTY-based group scheduling
相当多的评论,或许应该看一下.
Groups
are thus a nice feature, but they have not seen heavy use since they
were merged for the 2.6.24 release. The reasons for that are clear:
groups require administrative work and root privileges to set up; most
users do not know how to tweak the knobs and would really rather not
learn. What has been missing all these years is a way to make group
scheduling "just work" for ordinary users. That is the goal of .
In
short, this patch automatically creates a group attached to each TTY in
the system. All processes with a given TTY as their controlling
terminal will be placed in the appropriate group; the group scheduling
code can then share time between groups of processes as determined by
their controlling terminals.
反对者认为应该在用户态做,个人支持.
Lennart Poettering
that "Binding something like this to TTYs is just backwards"; he would
rather see something which is based on sessions. And, he said, all of
this could better be done in user space.
5 The media controller subsystem
contemporary
video devices are not just frame grabbers anymore. That complexity is
revealing limitations in the kernel's device model, prompting the
proposal of a new "media controller" abstraction.
Video
acquisition devices have never been entirely simple. Even a minimal
camera device will usually be a composite of at least three distinct
devices: a sensor, a DMA bridge to move frames between the sensor and
main memory, and an I2C bus dedicated to controlling the sensor. Most
devices coming onto the market now are more sophisticated than that.
各个部件可以独立运作,也可以协同工作,还可以和其它子系统的设备协作.当前的V4L2 system 和Linux device model没有考虑到.
The
patch creates a new media_device type which has the responsibility of
managing the various components which make up a media-related device.
These components are called "entities"; and they can take many forms.
Sensors, DMA engines, video processing units, focus controllers, audio
devices, and more are all considered to be "entities" in this scheme.
Most entities will have at least one "pad," being a logical connection
point where data can flow into or out of the device. "Data" in this
sense can be multimedia data, but it might also be a control stream.
Pads are exclusively input ("sink") or output ("source") ports, and an
entity can have an arbitrary number of each. The final piece is called a
"link"; it is a directional connection from a source pad to a sink.
Links are created by the media device driver, but they can, in some
cases, be enabled or disabled from user space.
As an aside:
entities also have a "group" number assigned to them; groups are
intended to indicate hardware which is meant to function together. All
of the units described above would probably be placed into the same
group by the driver.
the problem may be bigger than just media devices,其它部分也有类似问题.
6 Making attacks a little harder
make it harder for them to obtain information which could be used to compromise the kernel.
removing
world-read access from /proc/kallsyms, read-protect System.map as
well.So an attacker does not need to read /proc/kallsyms or System.map
if the target system is running a stock kernel; they need only dig up a
package file containing the needed information.
when the kernel
exposes pointer values to user space, it gives information to potential
attackers. These values can be found in a number of places, including
the system log and numerous places in /proc.Dan has posted adding a new sysctl knob controlling access to the syslog() system call.
The
proposal is simple: as much of the kernel should be read-only as
possible, most especially function pointers and other execution control
points, which are the easiest target to exploit when an arbitrary kernel
memory write becomes available to an attacker.
不允许自动加载模块
It
seems clear that a kernel which never allows users to trigger the
loading of modules is less likely to be affected by any vulnerability
which is found in a loadable module.
开发自动查找漏洞工具 One technique
which can help in this regard is "fuzzing," the process of passing
random values into system calls and looking for unexpected behavior.
1 Embedded Linux Flag Version
嵌入式社区的共同协调版本
2.6.35
will be the first embedded flag version, and it will be supported by
(at least) Sony, Google, MeeGo, and Linaro. "First, it should be
explained what having a flag version means. It means that suppliers and
vendors throughout the embedded industry will be encouraged to use a
particular version of the kernel for software development, integration
and testing. Also, industry and community developers agree to work
together to maintain a long-term stable branch of the flag version of
the kernel (until the next flag version is declared), in an effort to
share costs and improve stability and quality."
2 Linaro 10.11 released
3 Netoops
is a simple driver which will, in response to a kernel oops, collect
the most recent kernel logs and deliver them to a server across the net.
4 Checkpoint/restart: it's complicated
过于复杂,短期难以进入mainline.目前还有一个用户态实现项目,各有利弊.
5 ELCE: Grant Likely on
device trees有些SoC不能枚举设备,hard coded显然不如用device tree组织有效,问题是为什么PPC,X86也支持呢?
A
device tree represents the devices that are part of particular system,
such that it can be passed to the kernel at boot time, and the kernel
can initialize and use those devices. For architectures that don't use
device trees, C code must be written to add all of the different devices
that are present in the hardware.
"going data-driven to
describe our platforms is the right thing to do". There is proof that it
works in the x86 world as "that's how it's been done for a long time".
6
A more detailed look at kernel regressionsWysocki
handled the regression tracking himself, but it is now a three-person
operation, with Maciej Rutecki turning email regression reports into
kernel bugzilla entries, and Florian Mickler maintaining the regression
entries: marking those that have been fixed, working with the reporters
to determine which have been fixed, and so on.
Regressions for a
particular kernel release are tracked through the following two
development cycles. For example, when 2.6.36 was released, the tracking
of 2.6.34 regressions ended. That doesn't mean that any remaining
regressions have magically been fixed, of course, and they can still be
tracked using the meta-bug associated with a release.
Kernel | # reports | # pending |
---|
2.6.26 | 180 | 1 |
2.6.27 | 144 | 4 |
2.6.28 | 160 | 10 |
2.6.29 | 136 | 12 |
2.6.30 | 177 | 21 |
2.6.31 | 146 | 20 |
2.6.32 | 133 | 28 |
2.6.33 | 116 | 18 |
2.6.34 | 119 | 15 |
2.6.35 | 63 | 28 |
Total | 1374 | 157 |
Reported and pending regressions |
---|
The
lifetime of regressions:The average for the earlier kernels is 24.4
days, while the later kernels have an average of 32.3 days.
原因:后面rc版本一周内被修订的regression不统计.
Regressions by subsystem: those subsystems that are "closer" to the hardware tend to have more regressions.
对IPV6 设计人员的讽刺
i have theorized in the past that the problem we face is that an
insufficient number of axe murderers are attending those kinds of
research meetings.
-- on IPv6
1 The 2010 Kernel Summit
一大堆专题
a
Checkpoint/restart allows the state of a set of processes to be saved
to persistent storage, then restarted at some future time, possibly on a
different system.
b Linux at NASDAQ
意想不到的开销
The
consensus in the room was that the biggest piece of wakeup overhead is
saving and restoring the floating-point unit status. The exchange
doesn't do floating-point, of course, but the FPU covers a lot more than
basic number crunching anymore. In particular, are used to implement memcpy() in glibc, and SSE use will force a save/restore.
Asynchronous
network I/O remains high on his list; 10G Ethernet cards are out there,
and 40G is not that far away. That kind of interface can generate data
rates that are seriously difficult for the system to keep up with. So
they are looking at a number of techniques adopted by the InfiniBand
industry: separating control and data paths, bypassing the kernel for
data streams, etc. There is a lot of pressure to be able to keep up with
these data rates; the kernel will have to do something to reduce
network stack overheads and make it possible.
2 The second half of the 2.6.37 merge window
- It is now possible to build a generally useful kernel without the BKL
- The ext4 filesystem now supports "lazy inode table initialization"
- The file_system_type structure has a new mount() function which is meant to replace get_sb().