Chinaunix首页 | 论坛 | 博客
  • 博客访问: 51305
  • 博文数量: 8
  • 博客积分: 176
  • 博客等级: 入伍新兵
  • 技术积分: 85
  • 用 户 组: 普通用户
  • 注册时间: 2009-09-27 08:58
文章分类

全部博文(8)

文章存档

2016年(1)

2015年(1)

2011年(3)

2009年(3)

我的朋友

分类: LINUX

2016-01-03 16:19:57

现在的 Linux 内核和 Linux 2.6 的内核有多大区别?

现在已经是4.X了,但是据说2.6升到3.0,以及3.19升到4.0这之间都没什么太大的原因。 那么现在的内核2.6时代区别有多大?
按投票排序

4 个回答

? ? ?
知乎用户,RedHat前雇员, Canonical现雇员
每一个release具体做了什么改动, 请看这里:

我要开始搬运了:
2.6.39与3.0两个版本的发布间隔了64天. 那么到底发生了什么?
1. Prominent features 1.1. Btrfs: Automatic defragmentation, scrubbing, performance improvements

Automatic defragmentation

COW (copy-on-write) filesystems have many advantages, but they also have some disadvantages, for example fragmentation. Btrfs lays out the data sequentially when files are written to the disk for first time, but a COW design implies that any subsequent modification to the file must not be written on top of the old data, but be placed in a free block, which will cause fragmentation (RPM databases are a common case of this problem). Aditionally, it suffers the fragmentation problems common to all filesystems.

Btrfs already offers alternatives to fight this problem: First, it supports online defragmentation using the command "btrfs filesystem defragment". Second, it has a mount option, -o nodatacow, that disables COW for data. Now btrfs adds a third option, the -o autodefrag mount option. This mechanism detects small random writes into files and queues them up for an automatic defrag process, so the filesystem will defragment itself while it's used. It isn't suited to virtualization or big database workloads yet, but works well for smaller files such as rpm, SQLite or bdb databases. Code:

Scrub

"Scrubbing" is the process of checking the integrity of the data in the filesystem. This initial implementation of scrubbing will check the checksums of all the extents in the filesystem. If an error occurs (checksum or IO error), a good copy is searched for. If one is found, the bad copy will be rewritten. Code: ,

Other improvements

-File creation/deletion speedup: The performance of file creation and deletion on btrfs was very poor. The reason is that for each creation or deletion, btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. Now btrfs can do some delayed b+ tree insertions or deletions, which allows to batch these modifications. Microbenchmarks of file creation have been speed up by ~15%, and file deletion by ~20%. Code:

-Do not flush csum items of unchanged file data: speeds up fsync. A sysbench workload doing "random write + fsync" went from 112.75 requests/sec to 1216 requests/sec. Code:

-Quasi-round-robin for space allocation in multidevice setups: the chunk allocator currently always allocates space on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. Now Btrfs always sorts the devices before allocating, and allocates the stripes on the devices with the most available space. Code:

1.2. sendmmsg(): batching of sendmsg() calls

Recvmsg() and sendmsg() are the syscalls used to receive/send data to the network. In 2.6.33, Linux , a syscall that allows to receive in a single call data that would need multiple recvmsg() calls, improving throughput and latency for a number of scenarios. Now, a equivalent sendmmsg() syscall has been added. A microbenchmark saw a 20% improvement in throughput on UDP send and 30% on raw socket send

Code:

1.3. XEN dom0 support

Finally, Linux has got Xen dom0 support

1.4. Cleancache

Recommended LWN article:

Cleancache is an optional feature that can potentially increases page cache performance. It could be described as a memcached-like system, but for cache memory pages. It provides memory storage not directly accessible or addressable by the kernel, and it does not guarantee that the data will not vanish. It can be used by virtualization software to improve memory handling for guests, but it can also be useful to implement things like a compressed cache.

Code: ,

1.5. Berkeley Packet Filter just-in-time filtering

Recommended LWN article:

The Berkeley Packet Filter filtering capabilities, used by tools like libpcap/tcpdump, are normally handled by an interpreter. This release adds a simple JIT that generates native code when filter is loaded in memory (something already done by other OSes, like ). Admin need to enable this feature writting "1" to /proc/sys/net/core/bpf_jit_enable

Code:

1.6. Wake on WLAN support

Wake on Wireless is a feature to allow the system to go into a low-power state (e.g. ACPI S3 suspend) while the wireless NIC remains active and does varying things for the host, e.g. staying connected to an AP or searching for networks. The 802.11 stack has added support for it.

Code: ,

1.7. Unprivileged ICMP_ECHO messages

Recommended LWN article:

This release makes it possible to send ICMP_ECHO messages (ping) and receive the corresponding ICMP_ECHOREPLY messages without any special privileges, similar to what is implemented . In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. Initially this functionality was written for Linux 2.4.32, but unfortunately it was never made public. The new functionality is disabled by default, and is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range.

Code:

1.8. setns() syscall: better namespace handling

Recommended LWN article:

Linux supports different namespaces for many of the resources its handles; for example, lightweight forms of virtualization such as or systemd-nspaw show to the virtualized processes a virtual PID different from the real PID. The same thing can be done with the filesystem directory structure, network resources, IPC, etc. The only way to set different namespace configurations was using different flags in the clone() syscall, but that system didn't do things like allow to one processes to access to other process' namespace. The setns() syscall solves that problem-

Code: , , , , ,

1.9. Alarm-timers

Recommended LWN article:

Alarm-timers are a hybrid style timer, similar to high-resolution timers, but when the system is suspended, the RTC device is set to fire and wake the system for when the soonest alarm-timer expires. The concept for Alarm-timers was inspired by the Android Alarm driver, and the interface to userland uses the POSIX clock and timers interface, using two new clockids:CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.

Code: ,

2. Driver and architecture-specific changes

All the driver and architecture-specific changes can be found in the

3. VFS
  • Cache xattr security drop check for write: benchmarking on btrfs showed that a major scaling bottleneck on large systems on btrfs is currently the xattr lookup on every write, which causes an additional tree walk, hitting some per file system locks and quite bad scalability. This is also a problem in ext4, where it hits the global mbcache lock. Caching this check solves the problem

4. Process scheduler
  • Increase SCHED_LOAD_SCALE resolution: With this extra resolution, the scheduler can handle deeper cgroup hiearchies and do better shares distribution and load balancing on larger systems (especially for low weight task groups) ,

  • Move the second half of ttwu() to the remote CPU: avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. A semaphore benchmark goes from 647278 worker burns per second to 816715

  • Next buddy hint on sleep and preempt path: a worst-case benchmark consisting of 2 tbench client processes with 2 threads each running on a single CPU changed from 105.84 MB/sec to 112.42 MB/sec

5. Memory management
  • Make mmu_gather preempemtible

  • Batch activate_page() calls to reduce zone->lru_lock contention

  • tmpfs: implement generic xattr support

  • Memory cgroup controller:

    • Add memory.numastat API for NUMA statistics

    • Add the pagefault count into memcg stats

    • Reclaim memory from nodes in round-robin order

    • Remove the deprecated noswapaccount kernel parameter

6. Networking
  • Allow setting the network namespace by fd

  • Wireless

    • Add the ability to advertise possible interface combinations

    • Add support for scheduled scans

    • Add userspace authentication flag to mesh setup

    • New notification to discover mesh peer candidates.

  • Allow ethtool to set interface in loopback mode.

  • Allow no-cache copy from user on transmit

  • ipset: SCTP, UDPLITE support added

  • sctp: implement socket option SCTP_GET_ASSOC_ID_LIST , implement event notification SCTP_SENDER_DRY_EVENT

  • bridge: allow creating bridge devices with netlink , allow creating/deleting fdb entries via netlink

  • batman-adv: multi vlan support for bridge loop detection

  • pkt_sched: QFQ - quick fair queue scheduler

  • RDMA: Add netlink infrastructure that allows for registration of RDMA clients

7. File systems

BLOCK LAYER

  • Submit discard bio in batches in blkdev_issue_discard() - makes discarding data faster

EXT4

  • Enable "punch hole" functionality () ,

  • Add support for multiple mount protection

CIFS

  • Add support for mounting Windows 2008 DFS shares

  • Convert cifs_writepages to use async writes ,

  • Add rwpidforward mount option that enables a mode when CIFS forwards pid of a process who opened a file to any read and write operation

OCFS2

  • SSD trimming support ,

  • Support for moving extents ,

NILFS2

  • Implement resize ioctl

XFS

  • Add online discard support

8. Crypto
  • caam - Add support for the Freescale SEC4/CAAM

  • padlock - Add SHA-1/256 module for VIA Nano

  • s390: add System z hardware support for CTR mode , add System z hardware support for GHASH , add System z hardware support for XTS mode

  • s5p-sss - add S5PV210 advanced crypto engine support

9. Virtualization
  • User-mode Linux: add earlyprintk support , add ucast Ethernet transport

  • xen: add blkback support

10. Security
  • Allow the application of capability limits to usermode helpers

  • SELinux

    • add /sys/fs/selinux mount point to put selinuxfs

    • Make SELinux cache VFS RCU walks safe (improves VFS performance)

11. Tracing/profiling
  • perf stat: Add -d -d and -d -d -d options to show more CPU events ,

  • perf stat: Add --sync/-S option

12. Various core changes
  • rcu: priority boosting for TREE_PREEMPT_RCU

  • ulimit: raise default hard ulimit on number of files to 4096

  • cgroups

    • remove the Namespace cgroup subsystem. It has been replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file

    • Make 'procs' file writable

  • kbuild: implement several W= levels

  • PM/Hibernate: Add sysfs knob to control size of memory for drivers

  • posix-timers: RCU conversion

  • coredump: add support for exe_file in core name

3.19到4.0间隔了63天, 期间究竟发生了什么?:
1. Prominent features 1.1. Arbitrary version change

This release increases the version to 4.0. This switch from 3.x to 4.0 version numbers is, however, entirely meaningless and it should not be associated to any important changes in the kernel. This release could have been 3.20, but Linus Torvalds just got tired of the old number, , and changed it. Yes, it is frivolous. The less you think about it, the better.

1.2. Live patching

This release introduces "livepatch", a feature for live patching the kernel code, aimed primarily at systems who want to get security updates without needing to reboot. This feature has been born as result of merging kgraft and kpatch, two attempts by SuSE and Red Hat that where started to replace the now propietary ksplice. It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained and it doesn't hook itself in any other kernel subsystems.

In this release livepatch is not feature complete, yet it provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). Most CVEs should be safe to apply this way. Only the x86 architecture is supported in this release, others will follow.

For more details see the

Sample live patching module:

Code

1.3. DAX - Direct Access, for persistent memory storage

Before being read by programs, files are usually first copied from the disk to the kernel caches, kept in RAM. But the possible advent of persistent non-volatile memory that would be also be used as disk changes radically the way the kernel deals with this process: the kernel cache would become unnecesary overhead.

Linux has had, in fact, support for this kind of setups . But the code wasn't maintaned and only supported ext2. In this release, Linux adds DAX (Direct Access, the X is for eXciting). DAX removes the extra copy incurred by the buffer by performing reads and writes directly to the persistent-memory storage device. For file mappings, the storage device is mapped directly into userspace. Support for ext4 has been added.

Recommended LWN article:

Code: , , , , , , , , , , , ,

1.4. kasan, kernel address sanitizer

Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. Linux already has the kmemcheck feature, but unlike kmemcheck, KASan uses compile-time instrumentation, which makes it significantly faster than kmemcheck.

The main idea of KASAN is to use shadow memory to record whether each byte of memory is safe to access or not, and use compiler's instrumentation to check the shadow memory on each memory access. Address sanitizer uses 1/8 of the memory addressable in kernel for shadow memory and uses direct mapping with a scale and offset to translate a memory address to its corresponding shadow address.

Code: , , , ,

1.5. "lazytime" option for better update of file timestamps

Unix filesystems keep track of information about files, such as the last time a file was accessed or modified. Keeping track of this information is very expensive, specially the time when a file was accessed ("atime"), which encourages many people to disable it with the mount option "noatime". To alleviate this problem, the "relatime" mount option was added, the atime is only updated if the previous value is earlier than the modification time, or if the file was last accessed more than 24 hours ago. This behaviour, however, breaks some programs that rely on accurate access time tracking to work, and it's also against the POSIX standard.

In this release, Linux adds another alternative: "lazytime". Lazytime causes access, modified and changed time updates to only be made in the cache. The times will only be written to the disk if the inode needs to be updated anyway for some non-time related change, if fsync(), syncfs() or sync() are called, or just before an undeleted inode is evicted from memory. This is POSIX compliant, while at the same time improving the performance.

Recommended LWN article:

Code: , ,

1.6. Multiple lower layers in overlayfs

In overlayfs, multiple lower layers can now be given using the the colon (":") as a separator character between the directory names. For example:

  • mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged

The specified lower directories will be stacked beginning from the rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. "upperdir=" and "workdir=" may be omitted, in that case the overlay will be read-only.

Code: ,

1.7. Support Parallel NFS server, default to NFS v4.2

Parallel NFS (pNFS) is a part of the NFS v4.1 standard that allows compute clients to access storage devices directly and in parallel. The pNFS architecture eliminates the scalability and performance issues associated with NFS servers deployed today. This is achieved by the separation of data and metadata, and moving the metadata server out of the data path.

This release adds support for pNFS server, and drivers for the block layout with XFS support to use XFS filesystems as a block layout target, and the flexfiles layout.

Also, in this release the NFS server defaults to NFS v4.2.

Code: , , , , ,

1.8. dm-crypt scalability improvements

This release significantly increases the dm-crypt CPU scalability performance thanks to changes that enable effective use of an unbound workqueue across all available CPUs. A large battery of tests were performed to validate these changes, summary of results is available here

Merge:

2. File systems
  • XFS

    • Adds support for sys_renameat2()

    • Remove deprecated sysctls xfsbufd_centisecs and age_buffer_centisecs

  • EXT4

    • Support "readonly" filesystem flag to mark a FS image as read-only, tunable with tune2fs. It prevents the kernel and e2fsprogs from changing the image

  • Btrfs

    • Add code to support file creation time

  • NFSv4.1

    • Allow parallel LOCK/LOCKU calls

    • Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE

  • UBIFS

    • Add security.* XATTR support for the UBIFS

    • Add xattr support for symlinks

  • OCFS2

    • Add a mount option journal_async_commit on ocfs2 filesystem. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement

    • Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. In this version, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now , ,

  • F2FS

    • Introduce a batched trim

    • Support "norecovery" mount option, which is mostly same as "disable_roll_forward". The only difference is that "norecovery" should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process

    • Add F2FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v

3. Block
  • Ported to blk-multiqueue

    • loop: Add blk-mq support, which greatly improves performance for sequential and random reads

    • Device-mapper

    • rbd

    • UBI

  • blk-multiqueue: Add support for tag allocation policies and make libata use this blk-mq tagging, instead of rolling their own ,

  • UBI: Implement UBI_METAONLY, a new open mode for UBI volumes, it indicates that only meta data is being changed

4. Core (various)
  • pstore: Add pmsg - user-space accessible pstore object

  • rcu: Optionally run grace-period kthreads at real-time priority. Recent testing has shown that under heavy load, running RCU's grace-period kthreads at real-time priority can improve performance and reduce the incidence of RCU CPU stall warnings

  • GDB scripts for debugging the kernel. If you load vmlinux into gdb with the option enabled, the helper scripts will be automatically imported by gdb as well, and additional functions are available to analyze a Linux kernel instance. See Documentation/gdb-kernel-debugging.txt for further details

  • Remove CONFIG_INIT_FALLBACK

5. Memory management
  • cgroups: Per memory cgroup slab shrinkers

  • slub: optimize memory alloc/free fastpath by removing preemption on/off

  • Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately

  • Make /dev/mem an optional device

  • Add support for resetting peak RSS, which can be retrieved from the VmHWM field in /proc/pid/status, by writing "5" to /proc/pid/clear_refs

  • Show page size in /proc//numa_maps as "kernelpagesize_kB" field to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs

  • geneve: Add Geneve GRO support

  • zsmalloc: add statistics support

  • Incorporate read-only pages into transparent huge pages

  • memcontrol cgroup: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. The old interface will be maintained, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually

  • Replace remap_file_pages() syscall with emulation

6. Virtualization
  • KVM: Add generic support for page modification logging, a new feature in Intel "Broadwell" Xeon CPUs that speeds up dirty page tracking

  • vfio: Add device request interface indicating that the device should be released

  • vmxnet3: Make Rx ring 2 size configurable by adjusting rx-jumbo parameter of ethtool -G

  • virtio_net: add software timestamp support

  • virtio_pci: modern driver , add an options to disable legacy driver ,

7. Cryptography
  • aesni: Add support for 192 & 256 bit keys to AES-NI RFC4106

  • algif_rng: add random number generator support

  • octeon: add MD5 module

  • qat: add support for CBC(AES) ablkcipher

8. Security
  • SELinux : Add security hooks to the Android Binder that enable security modules such as SELinux to implement controls over Binder IPC. The security hooks include support for controlling what process can become the Binder context manager, invoke a binder transaction/IPC to another process, transfer a binder reference to another process , transfer an open file to another process. These hooks have been included in the Android kernel trees since Android 4.3 ().

  • SMACK: secmark support for netfilter ().

  • TPM 2.0 support (commits: , , ).

  • Device class for TPM, sysfs files are moved from /sys/class/misc/tpmX/device/ to /sys/class/tpm/tpmX/device/ ().

9. Tracing & perf
  • perf mem: Enable sampling loads and stores simultaneously, it could only do one or the other before yet there was no hardware restriction preventing simultaneous collection

  • perf tools: Support parameterized and symbolic events. See links for documentation ,

  • AMD range breakpoints support: breakpoints are extended to support address range through perf event with initial backend support for AMD extended breakpoints. For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512): perf record -e mem:0x1000/512:w ,

10. Networking
  • TCP: Add the possibility to define a per route/destination congestion control algorithm. This opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics

  • Mitigate TCP "ACK loop" DoS scenarios by rate-limiting outgoing duplicate ACKs sent in response to incoming "out of window" segments. For more details, see . Code: , , ,

  • udpv6: Add lockless sendmsg() support, thus allowing multiple threads to send to a single socket more efficiently

  • ipv4: Automatically bring up DSA master network devices, which allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP auto-configuration to succeed on these interfaces

  • ipv6: Add sysctl entry(accept_ra_mtu) to disable MTU updates from router advertisements

  • vxlan: Implement supports for the to provide a lightweight and simple security label mechanism across network peers based on VXLAN. It allows further mapping to a SELinux context using SECMARK, to implement ACLs directly with nftables, iptables, OVS, tc, etc

  • vxlan: Add support for remote checksum offload in VXLAN. It is described .

  • net: openvswitch: Support masked set actions.

  • Infiniband: Add support for extensible query device capabilities verb to allow adding new features

  • Layer 2 Tunneling Protocol (l2tp): multicast notification to the registered listeners when the tunnels/sessions are created/modified/deleted

  • SUNRPC: Set SO_REUSEPORT socket option for TCP connections to bind multiple TCP connections to the same source address+port combination

  • tipc: involve namespace infrastructure

  • 802.15.4: introduce support for cca settings

  • Wireless

    • Add new GCMP, GCMP-256, CCMP-256, BIP-GMAC-128, BIP-GMAC-256, and BIP-CMAC-256 cipher suites. These new cipher suites were defined in IEEE Std 802.11ac-2013 , , , ,

    • New NL80211_ATTR_NETNS_FD which allows to set namespace via nl80211 by fd

    • Support per-TID station statistics

    • Allow including station info in delete event ,

    • Allow usermode to query wiphy specific regdom

  • bridge

    • offload bridge port attributes to switch ASIC if feature flag set

    • Support for allowing userspace to pack multiple vlans and VLAN ranges in setlink and dellink requests for improved performance

    • Add ability to enable TSO

  • Near Field Communication (NFC)

    • HCI over NCI protocol support (Some secure elements only understand HCI and thus we need to send them HCI frames)

    • NCI NFCEE (NFC Execution Environment, typically an embedded or external secure element) discovery and enabling/disabling support , , , , , ,

    • NFC_EVT_TRANSACTION userspace API addition, it is sent through netlink in order for a specific application running on a secure element to notify userspace of an event

    • Tx timestamps are looped onto the error queue on top of an skb. This mechanism leaks packet headers to processes unless the no-payload options SOF_TIMESTAMPING_OPT_TSONLY is set. A new sysctl (tstamp_allow_data) optionally drops looped timestamp with data. This only affects processes without CAP_NET_RAW , ,

  • Bluetooth

    • Enable LE Data Length Extension feature from Bluetooth 4.2 specification

    • Expose information in debugfs: Secure Simple Pairing , debug keys usage setting , hardware error code , remote OOB information

    • HCI Read Stored Link Keys ,

    • HCI Delete Stored Link Key ,

    • Support static address when BR/EDR has been disabled

  • tc: add BPF-based action. This action provides a possibility to execute custom BPF code

  • net: sched: Introduce connmark action

  • Add Transparent Ethernet Bridging GRO support

  • netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature flag for switch device offloads

  • netfilter: nft_compat: add ebtables support

    • network namespace: Add rtnl cmd to add and get peer netns ids. A user can define an id for a peer netns by providing a FD or a PID. These ids are local to the netns where it is added (i.e. valid only into this netns) ,

  • openvswitch: Add support for checksums on UDP tunnels.

  • openvswitch: Support VXLAN Group Policy extension


每一个release具体做了什么改动, 请看这里:
? ? ?
,搞TI的IT
2.6各版本的区别也很大的
? ?
?
,Coder~~
namespace和cgroup吧,docker的实现就是基于这些。
? ?
我来回答这个问题

写回答…

758 人关注该问题
阅读(3799) | 评论(0) | 转发(0) |
0

上一篇:两篇关于 u-boot ecc的文章

下一篇:没有了

给主人留下些什么吧!~~