1 CPUS*PIDS = mess
2048 * 16 processes on each CPU = default 32K limit process IDs
加上缺省kernel thread, 可用pID 很多
目前无有效解决方案.
2 Suspend block
Android的API
allow
the system to automatically suspend itself when nothing is going on,
and allow code to say "something is going on" at both the kernel and
user-space levels.
3 Kernel Hacker's Bookshelf: Generating Realistic Impressions for File-System Benchmarking
"
This
paper describes Impressions, a tool for generating realistic,
reproducible file system images which can serve as the base of new file
system benchmarks.
4 Might 2.6.35 be BKL-free
5The cpuidle subsystem
On your editor's laptop, there are three idle states with the following characteristics:
| C1 | C2 | C3 |
---|
Exit latency (µs) | 1 | 1 | 57 |
---|
Power consumption (mW) | 1000 | 500 | 100 |
---|
各CPU特性不同, cpuidle code 抽象成一个dirver layer,特定于某个结构. 但是决定进入哪一个 idle state是于平台无关的,由cpuidle "governors" 来处理.
Every
processor has different idle-state characteristics and different
actions are required to enter and leave those states. The cpuidle code
abstracts that complexity into a separate driver layer; the drivers
themselves are often found in architecture-specific or ACPI code. On the
other hand, the decision as to which idle state makes sense in a given
situation is very much a policy issue. The cpuidle "governors" interface
allows the implementation of different policies for different needs.
首先要注册
cpuidle_driver, 然后Once the driver exists, though, it can register a cpuidle "device" for each CPU in the system.
每一个cpuidle "device" 可能有多个状态供选择,
struct cpuidle_state {
......
unsigned int flags;
unsigned int exit_latency; /* in US */
unsigned int power_usage; /* in mW */
unsigned int target_residency; /* in US */
.....
int (*enter) (struct cpuidle_device *dev,struct cpuidle_state *state);
};
enter成员起状态切换作用, 由 governor 触发
A
call to enter() is a request from the current governor to put the CPU
associated with dev into the given state. Note that enter() is free to
choose a different state if there is a good reason to do so, but it
should store the actual state used in the device's last_state field.
当cpu 无事可做时调用cpuidle governors进行选择.
Governors implement the policy side of cpuidle. The kernel allows the
existence of multiple governors at any given time, though only one will
be in control of a given CPU at any time. Governor code begins by
filling in a cpuidle_governor structure:
struct cpuidle_governor {
char name[CPUIDLE_NAME_LEN];
unsigned int rating;
int (*enable) (struct cpuidle_device *dev);
void (*disable) (struct cpuidle_device *dev);
int (*select) (struct cpuidle_device *dev);
void (*reflect) (struct cpuidle_device *dev);
struct module *owner;
/* ... */
};
The
select() function is called whenever the CPU has nothing to do and
wishes the governor to pick the optimal way of getting that nothing
done.
1 Ceph: The Distributed File System Creature from the Object Lagoon (Linux Mag)
Linux Magazine has which was merged for 2.6.34.
2 Fixing the ondemand governor
"cpufreq" subsystem 有三种governor
The "performance" governor prioritized throughput above all else, while
the "powersave" tries to keep power consumption to a minimum. The most
commonly-used governor, though, is "ondemand," which attempts to perform
a balancing act between power usage and throughput.
工作方式: 系统空闲时降低freq, 否则升高freq
ondemand works like this: every so often the governor wakes up and
looks at how busy the CPU is. If the idle time falls below a threshold,
the CPU frequency will be bumped up; if, instead, there is too much idle
time, the frequency will be reduced. By default, on a system with
high-resolution timers, the minimum idle percentage is 5%; CPU frequency
will be reduced if idle time goes above 15%. The minimum percentage can
be adjusted in sysfs (under /sys/devices/system/cpu/cpu
N/cpufreq/); the maximum is wired at 10% above the minimum.
当
前ondemand governor的不足之处: I/O-intensive 和CPU-intensive频繁切换导致cpu
freq频繁切换,性能下降,临时处理方式:The accounting of "idle time" is changed so that
time spent waiting for disk I/O no longer counts.
3 DM and MD come a little closer
What
the Linux kernel has, instead, is three different RAID implementations:
in the multiple device (MD) subsystem, in the device mapper (DM) code,
and in the Btrfs filesystem. It has often been said that unifying these
implementations would be a good thing, but and thus far, it has not happened.
4 ELC: Using LTTng
简略讨论了LTTng及其未来
5 When writeback goes wrong
非常有趣的话题
There
are two distinct ways in which writeback is done in contemporary
kernels. A series of kernel threads handles writeback to specific block
devices, attempting to keep each device busy as much of the time as
possible. But writeback also happens in the form of "direct reclaim,"
and that, it seems, is where much of the trouble is. Direct reclaim
happens when the core memory allocator is short of memory; rather than
cause memory allocations to fail, the memory management subsystem will
go casting around for pages to free.
direct claim容易导致栈溢出,Dave's answer was
which disables the use of writeback in direct reclaim. Instead, the
direct reclaim path must content itself with kicking off the flusher
threads and grabbing any clean pages which it may find.There is another
advantage to avoiding writeback in direct reclaim. The per-device
flusher threads can accumulate adjacent disk blocks and attempt to write
data in a way which minimizes seeks, thus maximizing I/O throughput.
Direct reclaim is also where
is done. The lumpy reclaim algorithm attempts to free pages in
physically-contiguous (in RAM) chunks, minimizing memory fragmentation
and increasing the reliability of larger allocations. There is,
unfortunately, a tradeoff to be made here: the nature of virtual memory
is such that pages which are physically contiguous in RAM are likely to
be widely dispersed on the backing storage device. So lumpy reclaim, by
its nature, is likely to create seeky I/O patterns, but skipping lumpy
reclaim increases the likelihood of higher-order allocation failures.
可看性不错
1 Idle cycle injection
Idle cycle injection is the forced idling of the CPU to avoid
overheating; essentially, it is Google's way of running processors to
the very edge of their capability without going past that edge and
allowing the smoke to escape.
Salman Qazi's recently posted shows the current form of this work.
The core idea is simple: through some new control files under
/proc/sys/kernel/kidled, the system administrator can set, on a per-CPU
basis, the percentage of time that the CPU should be idle and an
interval over which that percentage is calculated. If the end of an
interval draws near and the CPU has not been naturally idle for the
requisite time, kidled will force the processor to go idle for a while.
2 ELC: Status of embedded Linux
比较有希望的patch
kbuild CROSS_COMPILE option, which will make it easier to build for multiple architectures.
Arnd Bergmann's
patches that are geared towards making it easier to add new
architectures to the kernel—without propagating the bugs and quirks from
existing ones.
- Boot speed The Moblin effort really kickstarted that work。
Several new kernel features are available to help reduce boot time, including ,
which allow some parts of device initialization to run in parallel.
There is also scripts/bootgraph.pl to help visualize where boot time is
being spent.
was also noted as a way to decrease boot times - Kernel size
To help embedded developers make better use of limited memory, there is the
that was funded by CELF. Various compression methods have been added to
compress the kernel image in different ways. LZMA can be up to 30%
better than gzip, and LZO is not as good at compression, but is much
faster.The ramzswap device (also known as ) allows in-memory compressed swap.
本期可看性较好
1 A "live mode" for perf
Next up would seem to be ,
where perf no longer requires two steps: record the data, then analyze.
Live mode will allow perf trace record and perf trace report to operate
via a pipe, which allows instantaneous, as well as continuously
updating (a la top), output.
2 Bootmem 可能要被淘汰
Yinghai has a
on the horizon for 2.6.35: replacing the early_res code with the
"logical memory block" allocator currently used by some other
architectures. That change looks even more disruptive than the bootmem
elimination was.
3Memory management for virtualization
三个权宜技术:high memory ,huge pages, balloon drivers
Like
high memory and transparent huge pages, balloon drivers may eventually
be consigned to the pile of failed technologies. Until something better
comes along, though, we'll still need them.
4.Receive flow steering
包发往应用程序所在CPU,细节有待进一步了解
receive
packet steering (RPS) patches, provide a way to steer packets to
particular CPUs based on a hash of the packet's protocol data. Those
patches were applied to the network subsystem tree and are bound for
2.6.35, but now Herbert is back with an enhancement to RPS that will
attempt to steer packets to the CPU on which the receiving application
is running: (RFS).
5The padata parallel execution mechanism
In
short: padata is a mechanism by which the kernel can farm work out to
be done in parallel on multiple CPUs while retaining the ordering of
tasks. It was developed for use with the IPsec code,