分类: LINUX
2011-10-19 16:22:07
1
Wireless networking hacker Luis Rodriguez
has put together aimed
at developers writing and supporting wireless drivers.
2
integrating it with the perf/ftrace 遭到Peter Zijlstra和Thomas Gleixner的强烈反对,很难进入mainline.
3
int sendmmsg(int fd, struct mmsghdr *mmsg, unsigned int vlen, unsigned int flags);
It is the counterpart to recvmmsg(), allowing a process to send multiple messages with a single system call.
int strtobool(const char *s, bool *res);
Anything starting with one of [yY1] is considered to be true, while strings starting with one of [nN0] are false; anything else gets an -EINVAL error.
int kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res);
These functions take care of safely copying the string from user space and performing the integer conversion.
void *bsearch(const void *key, const void *base, size_t num, size_t size, int (*cmp)(const void *key, const void *elt));
4
在上周遭到反对的情况,进一步加强防止内核地址泄露。Dan 的工作
The default values of the kptr_restrict and dmesg_restrict sysctls are set to
(1) when this is enabled, since hiding kernel pointers is necessary to preserve
the secrecy of the randomized base address.前景不明,个人认为是值得的。
5
Andi Kleen may have been the first to question this optimization when he last
September. 经过Linus和Ingo的测试证明Andi是正确的。
the prefetch() calls have been removed from linked list, hlist, and sk_buff
list traversal operations - just like Andi Kleen tried to do in September.
在现代处理器中,硬件优化往往比软件优化做得更好。比如链表和hash table,经常prefetch 空指针(这是愚蠢的开销,20 cycles)。其它情况也不见得占上风
conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.
1 launches another new filesystem HAMMER2
2
为安全起见,a patch
was applied to censor kernel addresses appearing in /proc/kallsyms and
/proc/modules.地址信息以0表示.
Anybody wanting pointer hiding should turn it on by setting the kernel.kptr_restrict knob to 1.
3 (重要)
现状:
Current kernels actually maintain five LRU lists. There are separate active and inactive lists for file-backed pages. There are separate active and inactive lists for anonymous pages - reclaim policy for those pages is different, and, if the system is running without swap, they may not be reclaimable at all. There is also a list for pages which are known not to be reclaimable - pages which have been locked into memory, for example. Oh, and it's only fair to say that one set of those lists exists for each memory zone.
实现memory control group添加的成份
memory controller adds a new page_cgroup structure for each page; When memory control groups are active, there is another complete set of LRU lists maintained for each group. The list_head structures needed to maintain these lists are kept in the page_cgroup structure.
产生的问题: Global reclaim uses the global LRU as always, so it operates in complete ignorance of control groups. It will reclaim pages regardless of whether those pages belong to groups which are over their limits or not. Per-control-group reclaim, instead, can only work with one group at a time; as a result, it tends to hammer certain groups while leaving others untouched. The multiple LRU lists are not just complex, they are also expensive. A list_head structure is 16 bytes on a 64-bit system. If that system has 4GB of memory, it has 1,000,000 pages, so 16 million bytes are dedicated just to the infrastructure for the per-group LRU lists.
最新的补丁:
The from Johannes Weiner represent an attempt to create that improvement by better integrating the memory controller with the rest of the virtual memory subsystem. At the core of this work is the elimination of the duplicated LRU lists. In particular, with this patch set, the global LRU no longer exists - all pages exist on exactly one per-group LRU list. Pages which have not been charged to a specific control group go onto the LRU list for the "root" group at the top of the hierarchy. In essence, per-group reclaim takes over the older global reclaim code; even a system with control groups disabled is treated like a system with exactly one control group containing all running processes.
Algorithms for memory reclaim necessarily change in this environment. The core algorithm now performs a depth-first traversal through the control group hierarchy, trying to reclaim some pages from each. There is no global aging of pages; each group has its oldest pages considered for reclaim regardless of what's happening in the other groups. Each group's hard and soft limits are considered, of course, when setting reclaim targets. The end result is that global reclaim naturally spreads the pain across all control groups, implementing each group's policy in the process. The implementation of control group soft limits has been integrated with this mechanism, so now soft limit enforcement is spread more fairly across all control groups in the system.
Johannes's patch improves the situation while shrinking the code by over 400 lines; it also gets rid of the memory cost of the duplicated LRU lists.
4
纲领性介绍,无关细节
Code organization
驱动的情况
“Not in the arch/arm directory!” Drivers should move to the appropriate
subdirectory of the top-level drivers tree.
以下三者已变成通用代码,移出ARM tree
But what about non-driver code? Where should it live? It is helpful to look at several examples: (1) the struct clk code that Jeremy Kerr, Russell King, Thomas Gleixner, and many others have been working on, (2) the device-tree code that Grant Likely has been leading up, and (3) the generic interrupt chip implementation that Thomas Gleixner has been working on.
The struct clk code is motivated by the fact that many SoCs and boards have elaborate clock trees. These trees are needed, among other things, to allow the tradeoff between performance and energy efficiency to be set as needed for individual devices on that SoC or board. The struct clk code allows these trees to be represented with a common format while providing plugins to accommodate behavior specific to a given SoC or board. The has a similar role, but with respect to interrupt distribution rather than clock trees.
are intended to allow the hardware
configuration of a board to be represented via data rather than code, which
should ease the task of creating a single Linux kernel binary that boots on a
variety of ARM boards.
The struct clk code is already used by both the ARM and SH CPU architectures,
so it is not ARM-specific, but rather core Linux kernel code. Similarly, the
device-tree code is not ARM-specific; it is also used by the PowerPC,
Microblaze, and SPARC architectures, and even by . Device tree therefore is also Linux core kernel code.
The virtual-interrupt code goes even further, being common across all CPU
architectures.
特定于ARM的代码
There will of course need to be at least some ARM-specific code, but the end
goal is for that code to be limited to ARM core architecture code and ARM SoC
core architecture code. Furthermore, the ARM SoC core architecture code should
consist primarily of small plugins for core-Linux-kernel frameworks, which
should in turn greatly ease the development and maintenance of new ARM boards
and SoCs.
Git trees
目前有两个tree
Nicolas's existing git tree is an that allows people to easily pull the latest and greatest ARM code
against the most recent mainline kernel version. In contrast, a contains patches that are to be
upstreamed, normally based on a more-recent mainline release candidate.
5
概念:
the platform problem comes about when developers view the platform they are developing for as fixed and immutable. These developers feel that the component they are working on specifically (a device driver, say) is the only part that they have any control over. If the kernel somehow makes their job harder, the only alternatives are to avoid or work around it.
做为自由软件,我们可以修改让我们不爽的模块,而不是去适应它.但是这一点对驱动开发者很难做到
reach across arbitrary module boundaries and fix problems encountered in other parts of the system. We don't have to put up with bugs or inadequate features in the code we use; we can make it work properly instead.
1
(formerly LinuxBIOS) is a free BIOS
implementation; AMD将在一系列的处理器上支持它。
2
还是ABI的争论,一个已经没用的成员不能废除,因为影响了powertop.
Steven Rostedt认为应该使用tracing/events/sched/sched_switch/format之类的元数据确定成而不是员,dependent on the
binary format of the trace data exported from the kernel.
Steven Rostedt和 Ingo发生了一场辩论,Ingo认为应该废除ftrace,只使用perf,Steven坚持保留.Ingo是一个擅长摘果子的选手
3 2.6.39 development
statistics
There have been just over 10,000 non-merge changesets merged for 2.6.39; with
the sole exception of 2.6.37 (11,446 changesets), that's the highest since
2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with 1,276
developers) has ever exceeded that number. Those developers added 670,000 lines
of code while deleting 346,000 lines, for a net growth of 324,000 lines.
4 (重要)
问题背景:
The writeback code, when it gets around to that page, will mark the page
read-only, set the "under writeback" page flag, and queue the I/O
operation. The write-protection of the page is not there to prevent changes to
the page; its purpose is to detect further writes which would require that
another writeback be done. Current kernels will, in most situations, allow a
process to modify a page while the writeback operation is in progress.
In the worst case, the second write to the page will happen before the first
writeback I/O operation begins; in that case, the more recently written data
will also be written to disk in the first I/O operation and a second, redundant
disk write will be queued later. Either way, the data gets to its backing store,
which is the real intent.
bug的出现:Some
devices can perform , meaning that the data written to disk is checksummed by the
hardware and compared against a pre-write checksum provided by the kernel. If
the data changes after the kernel calculates its checksum, that check will
fail, causing a spurious write error.RAID也有类似的情况
解决方案:
实现细节
a page will be marked read-only when it is
written back; there is also a page flag which indicates that writeback is in
progress. So all of the pieces are there to trap writes to pages under
writeback. To make it even easier, the VFS layer already has a callback
(page_mkwrite()) to notify filesystems that a read-only page is being made
writable;
1
控制组的dcache用量,目前还未进入mainline
Pavel Emelyanov's are a first attempt at limiting dentry use. This patch works by organizing dentries into "mobs," being groups of dentries all of which represent names in a specific subtree of the filesystem.
2
问题背景
The performance monitoring unit (PMU) is normally associated with the CPU; each processor has its own PMU for monitoring its own specific events. Some newer processors (such as Intel's Nehalem series) also provide a PMU which is not tied to any CPU; in the Nehalem case it's part of the "uncore" which handles memory access at the package level.
问题:是否应该支持下面的raw events(magic, CPU and model specific incantations)
perf stat -e r1b7:20ff -a sleep 1
结论:raw events还会支持。 “The kernel has no business telling users which perf events are interesting, or limiting them!”,"generalized events"应该在用户态还是在核心态完成是个有趣的问题。
3
Sandboxing processes such that they cannot make "dangerous" system calls is an attractive feature that has already been implemented in a limited way for Linux with seccomp. Two years ago, we to allow more fine-grained control over which system calls would be allowed.
Seccomp (from "secure computing") is enabled via a prctl() call and, once enabled, restricts the process from making any further system calls beyond read(), write(), exit(), and sigreturn()—any other system call will abort the process.
进展:a from Will Drewry ,利用 to make the interface even more flexible by allowing filters to be applied to the system call arguments. Essentially, that would make for three choices for each system call: enabled, disabled, or filtered.
4
Clang static analyzer很强大,能够帮助发现一些错误。
vast majority of GCC extensions are supported by Clang,A bigger problem is that Clang lacks support for variable-length arrays in structures (VLAIS). There are some GCC extensions that aren't implemented, however, including explicit register variables.
Code generation and optimization problems
There are several code generation and optimization options for GCC that aren't supported by Clang. One of those is -mregparm that governs the number of registers used to pass integer arguments.
Also, -fcall-saved-reg is not supported by Clang and that affects the uses of the ALTERNATIVE() macro in the kernel
Another problem is with -pg, which enables instrumentation code for function calls in GCC, and is used when building Ftrace.
The final problem that Lelbach mentioned is the -fno-optimize-sibling-calls flag that is not supported by Clang. The flag disables tail call elimination, and the kernel introspection code (like Ftrace) assumes specific stack depths in various places.