分类: LINUX
2016-01-03 16:19:57
1. Prominent features 1.1. Btrfs: Automatic defragmentation, scrubbing, performance improvements3.19到4.0间隔了63天, 期间究竟发生了什么?:Automatic defragmentation
COW (copy-on-write) filesystems have many advantages, but they also have some disadvantages, for example fragmentation. Btrfs lays out the data sequentially when files are written to the disk for first time, but a COW design implies that any subsequent modification to the file must not be written on top of the old data, but be placed in a free block, which will cause fragmentation (RPM databases are a common case of this problem). Aditionally, it suffers the fragmentation problems common to all filesystems.
Btrfs already offers alternatives to fight this problem: First, it supports online defragmentation using the command "btrfs filesystem defragment". Second, it has a mount option, -o nodatacow, that disables COW for data. Now btrfs adds a third option, the -o autodefrag mount option. This mechanism detects small random writes into files and queues them up for an automatic defrag process, so the filesystem will defragment itself while it's used. It isn't suited to virtualization or big database workloads yet, but works well for smaller files such as rpm, SQLite or bdb databases. Code:
Scrub
"Scrubbing" is the process of checking the integrity of the data in the filesystem. This initial implementation of scrubbing will check the checksums of all the extents in the filesystem. If an error occurs (checksum or IO error), a good copy is searched for. If one is found, the bad copy will be rewritten. Code: ,
Other improvements
-File creation/deletion speedup: The performance of file creation and deletion on btrfs was very poor. The reason is that for each creation or deletion, btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. Now btrfs can do some delayed b+ tree insertions or deletions, which allows to batch these modifications. Microbenchmarks of file creation have been speed up by ~15%, and file deletion by ~20%. Code:
-Do not flush csum items of unchanged file data: speeds up fsync. A sysbench workload doing "random write + fsync" went from 112.75 requests/sec to 1216 requests/sec. Code:
-Quasi-round-robin for space allocation in multidevice setups: the chunk allocator currently always allocates space on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. Now Btrfs always sorts the devices before allocating, and allocates the stripes on the devices with the most available space. Code:
1.2. sendmmsg(): batching of sendmsg() callsRecvmsg() and sendmsg() are the syscalls used to receive/send data to the network. In 2.6.33, Linux , a syscall that allows to receive in a single call data that would need multiple recvmsg() calls, improving throughput and latency for a number of scenarios. Now, a equivalent sendmmsg() syscall has been added. A microbenchmark saw a 20% improvement in throughput on UDP send and 30% on raw socket send
Code:
1.3. XEN dom0 supportFinally, Linux has got Xen dom0 support
1.4. CleancacheRecommended LWN article:
Cleancache is an optional feature that can potentially increases page cache performance. It could be described as a memcached-like system, but for cache memory pages. It provides memory storage not directly accessible or addressable by the kernel, and it does not guarantee that the data will not vanish. It can be used by virtualization software to improve memory handling for guests, but it can also be useful to implement things like a compressed cache.
Code: ,
1.5. Berkeley Packet Filter just-in-time filteringRecommended LWN article:
The Berkeley Packet Filter filtering capabilities, used by tools like libpcap/tcpdump, are normally handled by an interpreter. This release adds a simple JIT that generates native code when filter is loaded in memory (something already done by other OSes, like ). Admin need to enable this feature writting "1" to /proc/sys/net/core/bpf_jit_enable
Code:
1.6. Wake on WLAN supportWake on Wireless is a feature to allow the system to go into a low-power state (e.g. ACPI S3 suspend) while the wireless NIC remains active and does varying things for the host, e.g. staying connected to an AP or searching for networks. The 802.11 stack has added support for it.
Code: ,
1.7. Unprivileged ICMP_ECHO messagesRecommended LWN article:
This release makes it possible to send ICMP_ECHO messages (ping) and receive the corresponding ICMP_ECHOREPLY messages without any special privileges, similar to what is implemented . In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. Initially this functionality was written for Linux 2.4.32, but unfortunately it was never made public. The new functionality is disabled by default, and is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range.
Code:
1.8. setns() syscall: better namespace handlingRecommended LWN article:
Linux supports different namespaces for many of the resources its handles; for example, lightweight forms of virtualization such as or systemd-nspaw show to the virtualized processes a virtual PID different from the real PID. The same thing can be done with the filesystem directory structure, network resources, IPC, etc. The only way to set different namespace configurations was using different flags in the clone() syscall, but that system didn't do things like allow to one processes to access to other process' namespace. The setns() syscall solves that problem-
Code: , , , , ,
1.9. Alarm-timersRecommended LWN article:
Alarm-timers are a hybrid style timer, similar to high-resolution timers, but when the system is suspended, the RTC device is set to fire and wake the system for when the soonest alarm-timer expires. The concept for Alarm-timers was inspired by the Android Alarm driver, and the interface to userland uses the POSIX clock and timers interface, using two new clockids:CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.
Code: ,
2. Driver and architecture-specific changesAll the driver and architecture-specific changes can be found in the
3. VFS4. Process scheduler
Cache xattr security drop check for write: benchmarking on btrfs showed that a major scaling bottleneck on large systems on btrfs is currently the xattr lookup on every write, which causes an additional tree walk, hitting some per file system locks and quite bad scalability. This is also a problem in ext4, where it hits the global mbcache lock. Caching this check solves the problem
5. Memory management
Increase SCHED_LOAD_SCALE resolution: With this extra resolution, the scheduler can handle deeper cgroup hiearchies and do better shares distribution and load balancing on larger systems (especially for low weight task groups) ,
Move the second half of ttwu() to the remote CPU: avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. A semaphore benchmark goes from 647278 worker burns per second to 816715
Next buddy hint on sleep and preempt path: a worst-case benchmark consisting of 2 tbench client processes with 2 threads each running on a single CPU changed from 105.84 MB/sec to 112.42 MB/sec
6. Networking
Make mmu_gather preempemtible
Batch activate_page() calls to reduce zone->lru_lock contention
tmpfs: implement generic xattr support
Memory cgroup controller:
Add memory.numastat API for NUMA statistics
Add the pagefault count into memcg stats
Reclaim memory from nodes in round-robin order
Remove the deprecated noswapaccount kernel parameter
7. File systems
Allow setting the network namespace by fd
Wireless
Add the ability to advertise possible interface combinations
Add support for scheduled scans
Add userspace authentication flag to mesh setup
New notification to discover mesh peer candidates.
Allow ethtool to set interface in loopback mode.
Allow no-cache copy from user on transmit
ipset: SCTP, UDPLITE support added
sctp: implement socket option SCTP_GET_ASSOC_ID_LIST , implement event notification SCTP_SENDER_DRY_EVENT
bridge: allow creating bridge devices with netlink , allow creating/deleting fdb entries via netlink
batman-adv: multi vlan support for bridge loop detection
pkt_sched: QFQ - quick fair queue scheduler
RDMA: Add netlink infrastructure that allows for registration of RDMA clients
BLOCK LAYER
Submit discard bio in batches in blkdev_issue_discard() - makes discarding data faster
EXT4
Enable "punch hole" functionality () ,
Add support for multiple mount protection
CIFS
Add support for mounting Windows 2008 DFS shares
Convert cifs_writepages to use async writes ,
Add rwpidforward mount option that enables a mode when CIFS forwards pid of a process who opened a file to any read and write operation
OCFS2
SSD trimming support ,
Support for moving extents ,
NILFS2
Implement resize ioctl
XFS
8. Crypto
Add online discard support
9. Virtualization
caam - Add support for the Freescale SEC4/CAAM
padlock - Add SHA-1/256 module for VIA Nano
s390: add System z hardware support for CTR mode , add System z hardware support for GHASH , add System z hardware support for XTS mode
s5p-sss - add S5PV210 advanced crypto engine support
10. Security
User-mode Linux: add earlyprintk support , add ucast Ethernet transport
xen: add blkback support
11. Tracing/profiling
Allow the application of capability limits to usermode helpers
SELinux
add /sys/fs/selinux mount point to put selinuxfs
Make SELinux cache VFS RCU walks safe (improves VFS performance)
12. Various core changes
perf stat: Add -d -d and -d -d -d options to show more CPU events ,
perf stat: Add --sync/-S option
rcu: priority boosting for TREE_PREEMPT_RCU
ulimit: raise default hard ulimit on number of files to 4096
cgroups
remove the Namespace cgroup subsystem. It has been replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file
Make 'procs' file writable
kbuild: implement several W= levels
PM/Hibernate: Add sysfs knob to control size of memory for drivers
posix-timers: RCU conversion
coredump: add support for exe_file in core name
1. Prominent features 1.1. Arbitrary version changeThis release increases the version to 4.0. This switch from 3.x to 4.0 version numbers is, however, entirely meaningless and it should not be associated to any important changes in the kernel. This release could have been 3.20, but Linus Torvalds just got tired of the old number, , and changed it. Yes, it is frivolous. The less you think about it, the better.
1.2. Live patchingThis release introduces "livepatch", a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained and it doesn't hook itself in any other kernel subsystems.
In this release livepatch is not feature complete, yet it provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). Most CVEs should be safe to apply this way. Only the x86 architecture is supported in this release, others will follow.
For more details see the
Sample live patching module:
Code
1.3. DAX - Direct Access, for persistent memory storageBefore being read by programs, files are usually first copied from the disk to the kernel caches, kept in RAM. But the possible advent of persistent non-volatile memory that would be also be used as disk changes radically the way the kernel deals with this process: the kernel cache would become unnecesary overhead.
Linux has had, in fact, support for this kind of setups . But the code wasn't maintaned and only supported ext2. In this release, Linux adds DAX (Direct Access, the X is for eXciting). DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device. For file mappings, the storage device is mapped directly into userspace. Support for ext4 has been added.
Recommended LWN article:
Code: , , , , , , , , , , , ,
1.4. kasan, kernel address sanitizerKernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.
The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.
Code: , , , ,
1.5. "lazytime" option for better update of file timestampsUnix filesystems keep track of information about files, such as the last time a file was accessed or modified. Keeping track of this information is very expensive, specially the time when a file was accessed ("atime"), which encourages many people to disable it with the mount option "noatime". To alleviate this problem, the "relatime" mount option was added, the atime is only updated if the previous value is earlier than the modification time, or if the file was last accessed more than 24 hours ago. This behaviour, however, breaks some programs that rely on accurate access time tracking to work, and it's also against the POSIX standard.
In this release, Linux adds another alternative: "lazytime". Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.
Recommended LWN article:
Code: , ,
1.6. Multiple lower layers in overlayfsIn overlayfs, multiple lower layers can now be given using the the colon (":") as a separator character between the directory names. For example:
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. "upperdir=" and "workdir=" may be omitted, in that case the overlay will be read-only.
Code: ,
1.7. Support Parallel NFS server, default to NFS v4.2Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path.
This release adds support for pNFS server, and drivers for the block layout with XFS support to use XFS filesystems as a block layout target, and the flexfiles layout.
Also, in this release the NFS server defaults to NFS v4.2.
Code: , , , , ,
1.8. dm-crypt scalability improvementsThis release significantly increases the dm-crypt CPU scalability performance thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here
Merge:
2. File systems3. Block
XFS
Adds support for sys_renameat2()
Remove deprecated sysctls xfsbufd_centisecs and age_buffer_centisecs
EXT4
Support "readonly" filesystem flag to mark a FS image as read-only, tunable with tune2fs. It prevents the kernel and e2fsprogs from changing the image
Btrfs
Add code to support file creation time
NFSv4.1
Allow parallel LOCK/LOCKU calls
Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE
UBIFS
Add security.* XATTR support for the UBIFS
Add xattr support for symlinks
OCFS2
Add a mount option journal_async_commit on ocfs2 filesystem. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement
Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. In this version, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now , ,
F2FS
Introduce a batched trim
Support "norecovery" mount option, which is mostly same as "disable_roll_forward". The only difference is that "norecovery" should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process
Add F2FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v
4. Core (various)
Ported to blk-multiqueue
loop: Add blk-mq support, which greatly improves performance for sequential and random reads
Device-mapper
rbd
UBI
blk-multiqueue: Add support for tag allocation policies and make libata use this blk-mq tagging, instead of rolling their own ,
UBI: Implement UBI_METAONLY, a new open mode for UBI volumes, it indicates that only meta data is being changed
5. Memory management
pstore: Add pmsg - user-space accessible pstore object
rcu: Optionally run grace-period kthreads at real-time priority. Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings
GDB scripts for debugging the kernel. If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance. See Documentation/gdb-kernel-debugging.txt for further details
Remove CONFIG_INIT_FALLBACK
6. Virtualization
cgroups: Per memory cgroup slab shrinkers
slub: optimize memory alloc/free fastpath by removing preemption on/off
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately
Make /dev/mem an optional device
Add support for resetting peak RSS, which can be retrieved from the VmHWM field in /proc/pid/status, by writing "5" to /proc/pid/clear_refs
Show page size in /proc/
/numa_maps as "kernelpagesize_kB" field to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs geneve: Add Geneve GRO support
zsmalloc: add statistics support
Incorporate read-only pages into transparent huge pages
memcontrol cgroup: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually
Replace remap_file_pages() syscall with emulation
7. Cryptography
KVM: Add generic support for page modification logging, a new feature in Intel "Broadwell" Xeon CPUs that speeds up dirty page tracking
vfio: Add device request interface indicating that the device should be released
vmxnet3: Make Rx ring 2 size configurable by adjusting rx-jumbo parameter of ethtool -G
virtio_net: add software timestamp support
virtio_pci: modern driver , add an options to disable legacy driver ,
8. Security
aesni: Add support for 192 & 256 bit keys to AES-NI RFC4106
algif_rng: add random number generator support
octeon: add MD5 module
qat: add support for CBC(AES) ablkcipher
9. Tracing & perf
SELinux : Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3 ().
SMACK: secmark support for netfilter ().
TPM 2.0 support (commits: , , ).
Device class for TPM, sysfs files are moved from /sys/class/misc/tpmX/device/ to /sys/class/tpm/tpmX/device/ ().
10. Networking
perf mem: Enable sampling loads and stores simultaneously, it could only do one or the other before yet there was no hardware restriction preventing simultaneous collection
perf tools: Support parameterized and symbolic events. See links for documentation ,
AMD range breakpoints support: breakpoints are extended to support address range through perf event with initial backend support for AMD extended breakpoints. For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512): perf record -e mem:0x1000/512:w ,
TCP: Add the possibility to define a per route/destination congestion control algorithm. This opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics
Mitigate TCP "ACK loop" DoS scenarios by rate-limiting outgoing duplicate ACKs sent in response to incoming "out of window" segments. For more details, see . Code: , , ,
udpv6: Add lockless sendmsg() support, thus allowing multiple threads to send to a single socket more efficiently
ipv4: Automatically bring up DSA master network devices, which allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP auto-configuration to succeed on these interfaces
ipv6: Add sysctl entry(accept_ra_mtu) to disable MTU updates from router advertisements
vxlan: Implement supports for the to provide a lightweight and simple security label mechanism across network peers based on VXLAN. It allows further mapping to a SELinux context using SECMARK, to implement ACLs directly with nftables, iptables, OVS, tc, etc
vxlan: Add support for remote checksum offload in VXLAN. It is described .
net: openvswitch: Support masked set actions.
Infiniband: Add support for extensible query device capabilities verb to allow adding new features
Layer 2 Tunneling Protocol (l2tp): multicast notification to the registered listeners when the tunnels/sessions are created/modified/deleted
SUNRPC: Set SO_REUSEPORT socket option for TCP connections to bind multiple TCP connections to the same source address+port combination
tipc: involve namespace infrastructure
802.15.4: introduce support for cca settings
Wireless
Add new GCMP, GCMP-256, CCMP-256, BIP-GMAC-128, BIP-GMAC-256, and BIP-CMAC-256 cipher suites. These new cipher suites were defined in IEEE Std 802.11ac-2013 , , , ,
New NL80211_ATTR_NETNS_FD which allows to set namespace via nl80211 by fd
Support per-TID station statistics
Allow including station info in delete event ,
Allow usermode to query wiphy specific regdom
bridge
offload bridge port attributes to switch ASIC if feature flag set
Support for allowing userspace to pack multiple vlans and VLAN ranges in setlink and dellink requests for improved performance
Add ability to enable TSO
Near Field Communication (NFC)
HCI over NCI protocol support (Some secure elements only understand HCI and thus we need to send them HCI frames)
NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support , , , , , ,
NFC_EVT_TRANSACTION userspace API addition, it is sent through netlink in order for a specific application running on a secure element to notify userspace of an event
Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. A new sysctl (tstamp_allow_data) optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW , ,
Bluetooth
Enable LE Data Length Extension feature from Bluetooth 4.2 specification
Expose information in debugfs: Secure Simple Pairing , debug keys usage setting , hardware error code , remote OOB information
HCI Read Stored Link Keys ,
HCI Delete Stored Link Key ,
Support static address when BR/EDR has been disabled
tc: add BPF-based action. This action provides a possibility to execute custom BPF code
net: sched: Introduce connmark action
Add Transparent Ethernet Bridging GRO support
netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads
netfilter: nft_compat: add ebtables support
network namespace: Add rtnl cmd to add and get peer netns ids. A user can define an id for a peer netns by providing a FD or a PID. These ids are local to the netns where it is added (i.e. valid only into this netns) ,
openvswitch: Add support for checksums on UDP tunnels.
openvswitch: Support VXLAN Group Policy extension
写回答…