lwn.net kernel news 2012/1-baozhao-ChinaUnix博客

原上草baozhao.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

baozhao

博客访问： 621386
博文数量： 197
博客积分： 7001
博客等级：大校
技术积分： 2155
用户组：普通用户
注册时间： 2005-02-24 00:29

文章分类

全部博文（197）

网络（2）
updating（7）
数据结构（1）
XEN（11）
ACM专题分类（11）
文史杂俎（4）
程序设计与数据结（1）
教育（4）
系统软件（16）

Windows（1）

UNIX（2）

BSD（0）
ACM竞赛（33）
外语（1）
围棋（1）
涂鸦（2）
VM Technology（12）
IT生活（12）
c/c++（14）
Linux（62）
未分配的博文（3）

文章存档

2022年（1）

2019年（2）

2015年（1）

2012年（100）

2011年（69）

2010年（14）

2007年（3）

2005年（7）

我的朋友

相关博文

lwn.net kernel news 2012/1

分类： LINUX

2012-03-25 10:41:19

1 A /proc/PID/mem vulnerability

主旨还是对Linus “Silent security fixes”的不满。

原因：打开/proc/pid/mem放弃常规的VFS检查,而在write是检查

如何利用：“So, we can open a fd to /proc/self/mem, lseek to the right place in memory for writing (more on that later), use dup2 to couple together stderr and the mem fd, and then exec to su $shellcode to write an shell spawner to the process memory, and then we have root. Really? Not so easy.” 详细分析http://blog.zx2c4.com/749

为了弥补slab不足的一个allocator，前景未明。

zsmalloc was designed to fulfill the needs of users where:

1) Memory is constrained, preventing contiguous page allocations larger than order 0 and

2) Allocations are all/commonly greater than half a page.

In a generic allocator, an allocation set like this would cause high fragmentation. The allocations can't span non- contiguous page boundaries; therefore, the part of the page unused by each allocation is wasted.

zsmalloc is a slab-based allocator that uses a non-standard malloc interface, requiring the user to map the allocation before accessing it. This allows allocations to span two non-contiguous pages using virtual memory mapping, greatly reducing fragmentation in the memory pool.

XFS以前的弊端：metadata writes were slow。（如unpacking a tarball）

解决方法：

l delay the journal updates and combine changes to the same block into a single entry. Those wanting details on how it works should find more than they ever wanted in in the kernel documentation tree.

l The log space reservation fast path is a very hot path in XFS; it is now lockless

l The asynchronous metadata writeback code was creating badly scattered I/O, reducing performance considerably. Now metadata writeback is delayed and sorted prior to writing out. That means that the filesystem is, in Dave's words, doing the I/O scheduler's work.

l "Active log items" are a mechanism that improves the performance of the (large) sorted log item list by accumulating changes and applying them in batches.

l Metadata caching has also been moved out of the page cache, which had a tendency to reclaim pages at inopportune times. And so on.

XFS的scalability目前比btrfs和Ext4好。Ext4, Dave said, is suffering from architectural deficiencies - using bitmaps for space tracking, in particular - that are typical of an 80's era filesystem.btrfs还有改进可能。

XFS的将来计划：Reliability，run a filesystem check and repair tool online。"Metadata validation" means making the metadata self describing to protect the filesystem against writes that are misdirected by the storage layer.要求更改XFS on-disk format

l The kernel has gained the ability to verify RSA digital signatures. The extended verification module (EVM) makes use of this capability.

l The slab allocator supports a new slab_max_order= boot parameter controlling the maximum size of a slab. Setting it to a larger number may increase memory efficiency at the cost of increasing the probability of allocation failures.

l The ALSA core has gained support for compressed audio on devices that are able to handle it.

l There have been some significant changes made to the memory compaction code to avoid the lengthy stalls experienced by some users when writing data to slow devices (USB keys, for example). This problem was described in , but the solution has evolved considerably. By making a number of changes to how compaction works, the memory management hackers (and Mel Gorman in particular) were able to avoid disabling synchronous compaction, which had the unfortunate effect of reducing huge page usage. See for a lot of information on how this problem was addressed.

l There is a new "charger manager" subsystem intended for use with batteries that must be monitored occasionally, even when the system is suspended. The charger manager can partially resume the system as needed to poll the battery, then immediately re-suspend afterward. See Documentation/power/charger-manager.txt for more information.

l The have been merged. These patches eliminate the double-tracking of memory and, thus, substantially reduce the overhead associated with the memory controller.

l The framebuffer device subsystem has a new FOURCC-based configuration API; see Documentation/fb/api.txt for details.

可能会出现两种技术的结合，但是前景不明

no_new_privs is not intended to be a sandbox at all -- it's a way to make it safe for a task to manipulate itself in a way that would allow it to subvert its own children (or itself after execve). So ptrace isn't a problem at all -- PR_SET_NO_NEW_PRIVS + chroot + ptrace is exactly as unsafe as ptrace without PR_SET_NO_NEW_PRIVS. Neither one allows privilege escalation beyond what you started with.

If you want a sandbox, call PR_SET_NO_NEW_PRIVS, then enable seccomp (or whatever) to disable ptrace, evil file access, connections on unix sockets that authenticate via uid, etc.

3 The future calculus of memory management

提出了如何更好地实现MM的问题。but this reminds me of differential calculus, where "dy" is performance and "dx" is RAM size. At every point in time, increasing dx past a certain size will have no corresponding increase in dy. Perhaps this suggests control theory more than calculus but the needed result is a true dynamic representation of "working set" size. Third, there is some cost for moving capacity efficiently; this cost (and impact on performance) must be somehow measured and taken into account as well.

But, in my opinion, this "calculus" is the future of memory management.

1 No more system devices

Linux device model中的 a special device class for "system devices"已经去掉了All in-tree system device drivers have been fixed up to use regular devices instead. The process is relatively simple; it can be seen in, for example, this commit updating kernel/time/clocksource.c. In short, the embedded struct sys_device becomes a simple struct device instead. Attributes defined with SYSDEV_ATTR() are switched to DEVICE_ATTR(). The sysdev_class structure is turned into a nearly empty bus_type structure instead. That is about all that is required.

老话题：he hope is to use such a mechanism as part of a sandboxing solution that would allow (for example) a web browser to run third-party code in a safer manner.

新方案： rather than use the ftrace filter mechanism, he has repurposed the networking layer's packet filtering mechanism (BPF).

l The "team" network driver - a lightweight mechanism for bonding multiple interfaces together - has been merged. The has the user-space code needed to operate this device.

l The network priority control group controller has been added. This controller allows the administrator to specify the priority with which members of each control group have access to the network interfaces available on the system. See from the documentation directory for more information.

l Also added is the which can be used to place limits on the amount of kernel memory used to hold TCP buffers.

l The infrastructure has been added, enabling control over how much data can be queued for transmission over a network interface at any time.

l The virtual network switch has been merged.

l The ARM architecture has gained support for the "large physical address extension," allowing 32-bit processors to address more than 4GB of installed memory.

l The "adaptive RED" queue management algorithm is now supported by the networking layer.

l The beginnings of support have been added to the wireless networking subsystem.

l The ext4 filesystem has added support for online resizing via the EXT4_IOC_RESIZE_FS ioctl() command. This operation does not (yet) work with filesystems using the "bigalloc" or "meta_bg" features.

l The /proc filesystem has a new subdirectory for each process called map_files; it contains a symbolic link describing every file-backed mapping used by the relevant process. This feature is one of many needed to support the desired checkpoint/restart feature.

l /proc also supports a couple of new mount options. When mounted with hidepid=1, /proc will deny access to any process directories not owned by the requesting process. With hidepid=2, even the existence of other processes will be hidden. The default (hidepid=0) behavior is unchanged. The other new option (gid=N) provides an ID for a group that is allowed to access information for all processes regardless of the hidepid= setting.

l Quite a few VFS interfaces have been changed to use the umode_t type for file mode bits.

l Also in the VFS: most of the members of struct vfsmount have been moved elsewhere (to a containing struct mount) and hidden from filesystem code. A number of callbacks in struct super_operations (specifically: show_stats(), show_devname(), show_path() and show_options()) now take a pointer to struct dentry instead of struct vfsmount.

调度关于power的部分过于复杂需要clean。但如何处理未达成共识。

the scheduler exports a couple of tuning knobs under /sys/devices/system/cpu. The first, called sched_mc_power_savings, has three possible settings:

The scheduler will not consider power usage when distributing tasks; instead, tasks will be distributed across the system for maximum performance. This is the default value.
One core will be filled with tasks before tasks will be moved to other cores. The idea is to concentrate the running tasks on a relatively small number of cores, allowing the others to remain idle.
Like (1), but with the additional tweak that newly awakened tasks will be directed toward "semi-idle" cores rather than started on an idle core.

There is another knob, sched_smt_power_savings, that takes the same set of values, but applies the results to the threads of symmetric multithreading (SMT) processors instead.

the real problem seems to be the control knobs. The two knobs provide similar behavioral controls at two levels of the hierarchy. But, with three possible values for each, the result is nine different modes that the scheduler can run in. That seems like too much complexity for a situation where the real choice comes down to "run as fast as possible," or "use as little power as possible."

The core idea remains the same, though: this mechanism allows DMA buffers to be shared between drivers that might otherwise be unaware of each other. The initial target use is sharing buffers between producers and consumers of video streams; a camera device, for example, could acquire a stream of frames into a series of buffers that are shared with the graphics adapter, enabling the capture and display of the data with no copying in the kernel.

In the 3.3 sharing scheme, one driver will set itself up as an exporter of sharable buffers, 会利用an anonymous file to represent the buffer。然后requires obtaining a file descriptor for it and making that descriptor available to user space.。在另一方：A driver wishing to share a DMA buffer has to go through a series of calls after obtaining the corresponding file descriptor

SSD设备的特点The schedulers currently in use in Linux were designed with rotating storage in mind, with the result that they are concerned with avoiding disk seeks and tracking the number of bytes transferred. With solid-state devices, though, I/O locality is (nearly) irrelevant and the number of I/O operations performed is considered to be a better measurement of the amount of device capacity used.

Shaohua Li has taken a new approach with the posting of that is optimized for solid-state devices. The patch set factors out and generalizes the CFQ code that tracks device usage, but then uses that code to implement a different scheduling algorithm. Avoiding seeks is no longer a concern; neither is the number of bytes transferred. Instead, this scheduler simply tracks the number of I/O operations submitted by each user, trying to equalize the number from each.

安全问题：SCSI pass-through SG_IO ioctl() to a particular disk partition (e.g. /dev/sdb2) or LVM volume, which causes the SCSI command to be sent to underlying block device (/dev/sdb). 跨越分区从而可能影响另一个VM。

Bonzini posted to disallow most SCSI commands on partition-like devices. So, doing any of the "dangerous" SCSI commands would fail unless the ioctl() is being called on the underlying block device. 但是目前被认为测试不充分，可能会影响现有的程序。

3 Safe device assignment with VFIO

背景：Some high-performance applications want to talk to devices directly. Virtualized guests can also be thought of as a sort of user-space process;The kernel's interface has been available for the implementation of user-space drivers for some years. UIO has some shortcomings, though, including a lack of support for direct memory access (DMA) operations. IOMMU能解决该问题：only specific regions of memory are accessible to them. Technologies like KVM support a "device assignment" mechanism that uses the hardware capabilities to hand a device to a guest, but device assignment is not without its shortcomings. Among other things, device assignment alone cannot guarantee the isolation of a specific device, and it involves a fair amount of complexity in the kernel.

解决方法：

Alex Williamson's is an attempt to come up with a better solution that allows the development of safe, high-performance user-space drivers. It provides interfaces allowing those drivers to work with DMA and interrupts while keeping overall control over how devices access the system's resources.该patch前景不明。

Andriod的logger目前还难以merge.

阅读(919) | 评论(0) | 转发(0) |

上一篇：学步

下一篇：Linux Kernel Exploit(二）—Linux Local Privilege Escalation via SUID /proc/pid/mem Write

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6