Interview: Jens Axboe-xiegang112-ChinaUnix博客

Jens Axboe has been involved with Linux since 1993. 30 years old, he lives in Copenhagen, Denmark, and works as a Linux Kernel developer for Oracle. His block layer rewrite launched the 2.5 kernel development branch, a layer he continues to maintain and improve. Interested in most anything dealing with IO, he has introduced several new IO schedulers to the kernel, including the default CFQ, or Complete Fair Queuing scheduler.

In this interview, Jens talks about how he got interested in Linux, how he became the maintainer of the block layer and other block devices, and what's involved in being a maintainer. He describes his work on IO schedulers, offering an indepth look at the design and current status of the CFQ scheduler, including a peek at what's in store for the future. He conveys his excitement about the new splice IO model, explaining how it came about and how it works. And he discusses the current 2.6 kernel development process, the impact of git, and why the GPL is important to him.

Submit to: , , , ,

Background:
Jeremy Andrews: Please share a little about yourself and your background.

Jens Axboe: I live in Copenhagen, Denmark, with my wife and 2 year old son. I am 30 years old. I'm employed by Oracle as of recently, who graciously allow me to spend all of my time on Linux kernel hacking. Before that I worked in SUSE Labs. My primary office is at home, though I do work from the local company office on occasion. I studied CS at the University of Copenhagen for a period, but never finished. I blame Linux for that!

JA: Does Oracle guide what features in the Linux kernel you work on?

Jens Axboe: Generally, no. It just so happens that the things that really interest me, are also things that have great relevance to running Oracle on Linux. That, of course, influenced my decision on where to work - I think it's quite important to keep as much of the decision making out in the open, it's how Linux got to be as great as it is. Outside of that, I should mention that Oracle has a really good policy for all the people in the kernel team, they all work on the Linus tree and not some hidden away Oracle private kernel package.

JA: When did you get started with Linux?

Jens Axboe: I first learned of Linux through an article in a Danish paper, pitting Linux and OS/2 against each other. The paper was a sort of "Engineering Weekly" that my father received each Friday, and I usually leafed through it for interesting articles and math challenges they had on the back. Back then I was in programming C for fun with Turbo C in DOS and dealing with the memory segmentation issues, and Linux was immediately appealing to me as a true 32-bit operating system was just a much cleaner development environment. But this was in 1993, and the only way to obtain Linux easily was over a Swedish BBS service. Needless to say, my father didn't find the thought of paying for days of telephone calls to Sweden appealing (I did it anyway, though), and I eventually found a second hand Yggdrasil CD locally and went with that. After a few weeks of experimentation, I eventually got it installed. After that, there was no way back.

JA: Do you remember what version the kernel was at when you first got it installed?

Jens Axboe: Roughly, it was version 0.99pl13, or some patch level close to that. The reason it took so long to get installed was that it kept panicking on me on startup because it didn't recognize my CD-ROM. So I first had to learn what a kernel was, what a panic was, and why this kernel thing kept getting panicky on me.

JA: What did you do with Linux back in those early days?

Jens Axboe: Programming, mainly. It was also my first exposure to a UNIX like system, so I spent quite a lot of time getting acquainted with that. It was a whole new dimension of computing to me. I remember cutting days of school just to get DOSEMU working, for instance :-). Or staying up the whole night downloading and building a new X system to get support for my ATI graphics card. I think the primary appeal to Linux was that everything was so transparent, there were really no limits to what you could do with the system. If Windows 3.x crashed on you, diagnosing and finding that problem was really hard. With Linux, everything was right there for you to look at.

Block Layer:
JA: You are listed as the maintainer of the Linux Kernel block layer. What is the block layer, and what is involved in maintaining it?

Jens Axboe: The block layer is the piece of software that sits between the block device drivers (managing your hard drives, cdroms, etc) and the file systems. It's a fairly broad term for a wide range of functionality and services. While it's both a layer between IO producer and consumer, it also offers a host of helper functionality to drivers helping to manage queuing and resources. It also includes vital functionality such as IO scheduling.

During the 2.5 development cycle I rewrote basically all of the block layer. The block layer had long been the ridicule of the Linux kernel at that point. It didn't scale well to more than a few CPU's, it didn't support IO to highmem pages, and IO queuing was limited to a single page at the time. The rewrite fixed all of these issues, and I became the block layer maintainer along the way. The block layer rewrite opened the 2.5 development cycle by being merged in 2.5.0-pre1.

Maintaining the block layer takes some effort. More and more functionality has been moving from drivers (mainly the SCSI subsystem) into the block layer as core functionality, greatly expanding the features for other drivers along the way. The goal is shrinking the size and complexity of block device drivers, to (hopefully) eliminate bugs in that area. Making complex block device drivers easier to write is one of the benefits of that. The primary reasoning behind that being that less complex drivers are less buggy drivers. The chance of a bug in the block layer is a lot smaller than in drivers, as drivers are often written by people with less experience in Linux.

JA: Who maintained the block layer before you took over during the 2.5 development cycle?

Jens Axboe: The block layer had no real maintainer. Linus wrote the initial ll_rw_blk.c file way back when, and various people had been adding hacks here and there to do what they wanted. So it was all pretty messy and lacked a real design.

JA: How do advanced new file-systems such as Reiser4 or ZFS affect the block layer?

Jens Axboe: They don't really affect the block layer. The more advanced file systems (not sure about reiser4, and ZFS doesn't really exist for Linux yet) like XFS do take advantage of the large IO support, so they can queue a big chunk of data with the block layer and IO scheduler in a single IO unit. This reduces lock contention and runtime in the IO scheduler itself. SGI did a talk on page cache scalability at the 2006 OLS conference, and they didn't hit any block layer or SCSI layer problems even doing 10GiB/sec IO. So I'd say we are in pretty good shape in that area!

JA: You recently posted a series of patches to make the block layer use explicit plugging instead of implicit plugging. What is plugging, and how do these two types of plugging differ?

Jens Axboe: The basic design was done by Nick Piggin, who used to do IO scheduling work as well, but mainly does vm stuff these days. So he'd like to clean up the ->sync_page() operation in the address_space structure, since it's a bit of a hack on the vm side to force unplugging of a device when a page is needed. Moving control of the plugging to the issuer of the IO gets rid of that hack, since the actual plugging and unplugging is then done explicitly by the caller. The current design is a little tricky, since the queue is plugged behind the back of the IO issuer, but it requires the issuer to explicitly unplug the device if he needs to wait for some of that IO to complete. That has been the source of some IO stall bugs in the past.

Plugging itself is a mechanism to slightly defer starting an IO stream until we have some work for the device to do. You can compare it to inserting the plug in the bathtub before you fill it with water, the water will not flow out of the tub until you remove the plug again. Plugging helps both the IO scheduler do request merging (thus reducing the total number of IO's we will send to the device), and also helps build a few requests before handing them to the device. The latter increases performance in devices that have an internal queue depth, like SCSI drives (with TCQ) or newer SATA drives with NCQ.

The new explicit plugging scheme plugs the process instead of the queue. So the plug state has moved from the device to the process doing IO, which brings another nice little optimization - you can now queue IO completely lockless, since the state resides in the process. Only when you unplug the device will you send them to the device. So you reduce the number of lock/unlock operations on the device considerably. So far we've yet to demonstrate a big win by doing this, but I'd be happy enough if it doesn't actually cause any performance regressions.

JA: There was recently a long discussion on the LKML regarding the use of the O_DIRECT flag when opening files. What are your views on O_DIRECT's implementation and use cases?

Jens Axboe: Personally I like O_DIRECT, I think it's a nice way to do fast and non-cache polluting IO. The problem is the actual implementation. Basically it's a completely separate path from regular page cache buffered IO, which is never a good thing. Only when it reaches the block layer are we again dealing with the same units (a bio structure). So O_DIRECT tries to be faster by not doing page cache lookups for IO, and by mapping the user data into the kernel to avoid copying data around. The latter is clearly a good idea, I'm not sure there's much merit to the former. I don't think the implementation has to be as bad as it is, so hopefully someone will get it cleaned up and sanitized eventually. Who knows, perhaps splice can even help with that :-)

JA: You are also listed as the maintainer of the IDE/ATAPI CD-ROM driver, the SCSI CD-ROM driver, and the uniform CD-ROM driver. How did you become the maintainer of these drivers, and how much effort is involved?

Jens Axboe: Being CD-ROM maintainer was my first real Linux job. It came about because the previous maintainer had to step down for personal reasons, and he was looking for volunteers to take over. For me it was very much a learning experience. There's no better way to learn than just diving in, so I expanded the uniform CD-ROM layer (cdrom.c, the shared functionality between the various CDROM drivers) and added support for things like DVD, Mt Rainier, and DVD-RAM. Maintaining ATAPI CDROM is tricky - it's a generic ATAPI driver that adheres to the specifications, but there are so many vendors out there trying to make ATAPI hardware that even if a few of them have specification deviations, it quickly makes life really difficult. I recently passed on ATAPI CDROM maintainer ship to Alan Cox, as he's been doing lots of parallel ATA work both recently and in the past. These days I prefer to focus on different areas.

Schedulers:
JA: In September of 2002 you posted a patch for a "deadline" I/O scheduler to the Linux Kernel Mailing List which was quickly merged into the stable Linux kernel. How did the deadline scheduler differ from the previous scheduler?

Jens Axboe: The original 2.4 IO scheduler was a plain elevator, which has obvious starvation issues. After bouncing a few ideas back and forth, the "elevator linus" was born. It was based on ideas by Linus, hence I named it after him. "elevator linus" was still a basic elevator at heart, but it added starvation measures to prevent the worst case situations. The main issue with "elevator linus" was that it was hard to tune appropriately, so I designed the deadline IO scheduler to counter that. The deadline IO scheduler is a CSCAN based design, but with a double FIFO list. deadline assigns a time deadline to each queued request (hence the name), putting them on a FIFO list in expiry order. When a request expires, CSCAN delivery is restarted from the expiration point and continues in a batch from there until the situation occurs again. deadline does batches of requests even if more expire to avoid seek storms on a busy drive.

The major new introduction with deadline is a clean time based tuning. It's much more intuitive to the user to be dealing with a time unit, than some request segment numbers that a) don't make any sense to a user, and b) vary greatly depending on the target drive. Even with the more advanced schedulers available today, deadline is still a really good choice if individual request latency is the primary concern.

JA: Can you explain more about the original 2.4 IO scheduler? What is an "elevator", and what are the associated starvation issues?

Jens Axboe: An elevator is a classical IO scheduling algorithm. The name reflects how it works, it provides disk service in both direction (elevator going up and down). So the 2.4 IO scheduler was exactly like that, a simple list of pending work and a few lines of code to handle where to insert new work.

The main starvation issue with strict elevators is that a request that is behind the head (in terms of the moving direction of the elevator) can get starved for a really long time if another process is doing streaming IO on some location before that. The classic IO scheduling algorithms you find in OS text books aren't really useful in a modern system.

JA: What got you interested in improving the scheduler logic?

Jens Axboe: Scheduling algorithms fascinate me in general, but I think it was obvious deficiencies in the original algorithm that really made me want to dive in and see if I could improve it. Coming from the CDROM code, I was also very interested in finding out what happened above the block device drivers. Finding something to fix is really the best motivation for getting your hands dirty.

JA: Thanks to another of your patches, it is possible to choose from several schedulers when booting the Linux kernel. How many schedulers are currently included in the mainline kernel, how do they differ, and why might you choose one over the other?

Jens Axboe: You can in fact change the IO scheduler in a running system now, while the drive is busy. So it's not just a global setting anymore, you can experiment with each of them without rebooting. The IO schedulers can even be configured as loadable kernel modules.

There are 4 schedulers included in the kernel. The most basic is "noop", which doesn't do anything. It just queues requests and hands them out in FIFO order. It's primary application is really intelligent hardware with deep queue depths, where even basic request reordering doesn't make any sense. The next is deadline, which I already described. It's main application is latency critical applications, where throughput is a secondary concern. Then there are two "intelligent" schedulers, "anticipatory" and "cfq". "anticipatory" is based on deadline, but adds IO anticipation to the mix. The basic concept is to allow dependent serialized reads to be processed quickly, by allowing the drive to idle shortly if we expect a nearby request from the same process shortly. So while "anticipatory" doesn't attempt to handle process fairness, it was the first to provide good disk throughput for shared IO workloads.

The noop IO scheduler should only be used for specialized hardware, and deadline has a limited scope as well. I would say that going with the default scheduler is the best choice for by far most applications, but if you have IO intensive applications, experimenting the different schedulers and their tunables might prove beneficial.

JA: Are there any risks involved with changing the active IO scheduler on a running system?

Jens Axboe: If there is, that would be a bug. When you change the IO scheduler on a running system, the block device queue is quiesced. The currently pending requests are drained while we don't allow new ones to enter the queue, and as soon as the queue has emptied, we setup the new scheduler structures and switch everything over. The code is a bit tricky and there have been some bugs in the earlier days, but it should be stable now.

CFQ Scheduler:
JA: In February of 2003 you posted two new I/O schedulers. The first was the Stochastic Fair Queuing (SFQ) scheduler. The second was the Complete Fair Queuing (CFQ) scheduler. What prompted you to try and improve upon your earlier deadline scheduler?

Jens Axboe: While deadline worked great from a latency and hard drive perspective, it had no concept of individual process fairness. SFQ is originally a network queuing algorithm, but the main concept applied equally well to IO queuing disciplines. The core design is dividing a flow of packets into a fixed number of buckets, based on the hash of the source. So if you have N number of buckets, the algorithm should be fair to N number of processes (barring no hash collisions, hence the occasional hash function change). So the SFQ IO scheduler divides incoming process IO into a fixed number of buckets, and does request dispatch in a round robin fashion from these buckets.

JA: Your CFQ scheduler was merged into the 2.6.6 mainline Linux kernel in April of 2004. Can you describe the design of the CFQ scheduler, what makes it "completely fair", and why it replaced the SFQ scheduler?

Jens Axboe: The main primary difference between SFQ and CFQ, is that CFQ has a per-process bucket always. Hence it gets rid of the fixed number of buckets and source hashing, removing the potential hash collision issue there. So I needed a new acronym to differentiate it from SFQ, and I came up with CFQ to indicate the complete fairness of this variant of SFQ. If I recall correctly, there was only a day or two between the initial posting of SFQ and the later CFQ version, so there really wasn't that much difference between the two.

JA: What is the current status of the CFQ scheduler?

Jens Axboe: CFQ has undergone 3 revisions so far. The third incarnation (which was merged for the 2.6.13 kernel) shares almost nothing with the original version except that they both try to provide fairness across processes. CFQ now uses a time slice concept for disk sharing, similar to what the process scheduler does. Classic work conserving IO schedulers tend to perform really poorly for shared workloads. A good example of that is trying to edit a file while some other process(es) are doing write back of dirty data. Reading even a small file often requires a series of dependent IO requests - reading file system meta data, looking up the file location, and finally reading in the file. Each read request is serialized, which means that a work conserving scheduler will immediately move on to sending more writes to the device after each consecutive read completes. Even with a fairly small latency of a few seconds between each read, getting at the file you wish to edit can take tens of seconds. On an unloaded system, the same operation would take perhaps 100 milliseconds at most. By allowing a process priority access to the disk for small slices of time, that same operation will often complete in a few hundred milliseconds instead. A different example is having more two or more processes reading file data. A work conserving scheduler will seek back and forth between the processes continually, reducing a sequential workload to a completely seek bound workload. The result is that eg deadline would give you an aggregate throughput of perhaps 1MiB/sec, while CFQ will run at 95-98% of the disk potential often doing 30-50MiB/sec depending on the drive characteristics.

CFQ is stable, but undergoing gradual changes to tweak performance on more complex hardware. It has also grown more advanced features such as IO priority support, allowing a user to define the IO priority of a process with ionice, similar to CPU scheduling. CFQ has been the default IO scheduler for the Linux kernel since 2.6.18, and distributions such as SUSE and Red Hat have defaulted to CFQ for much longer than that.

JA: Are there any known outstanding bugs with the CFQ scheduler?

Jens Axboe: Not that I'm aware of, it's perfectly bug free. Well maybe not, but at least I'm not aware of any pending issues I need to resolve :-)

JA: What types of performance improvements have you made recently to the CFQ scheduler?

Jens Axboe: Recently it's mainly been tweaks rather than larger changes. The current design is quite sane and stabilized and performs well for a larger number of workloads and hardware. So lately it might be things like looking into why one of the other IO schedulers are doing a little bit better for some workload/hardware combinations, finding out why, and getting that fixed in CFQ. It's quite hard to characterize a generic model of how a disk behaves, since the disk might not even be a disk - it could be a large array hidden behind an intelligent IO controller. And that behaves quite differently to a normal disk and may need more work in one area, and less in another.

JA: What future performance improvements do you have planned?

Jens Axboe: One larger improvement I have planned is doing a full analysis of how CFQ interacts with devices that do command queuing and make sure that CFQ takes the best advantage of that. I've already done a bit of work in this area, since command queuing is become a commodity even on the desktop these days with SATA-II hard drives. More still needs to be done, however.

I'd also like to put some more thought into handling interactive versus non-interactive tasks. And if you have two or more tasks working on the same area of the disk, CFQ could help them cooperate better. So there's definitely still work to be done!

JA: How might the scheduler help multiple tasks working in the same area of the disk cooperate better?

Jens Axboe: The primary reason that the anticipatory and CFQ IO scheduler do so well on tricky workloads, is that they recognize the fact that synchronous dependent IO requests that are close on disk need a little help to perform well. If you have multiple tasks working in the same area on disk, it would make sense to let them cooperate in when they get disk time. A silly example of that is doing something like:

  $ find . -type f -exec grep somepattern '{}' \;

where you continually fork a new grep, which will do IO very close to the grep that just exited. Currently CFQ doesn't do so well for that, since when the grep process exits it won't ever idle the queue.

JA: Can you explain a little more about using ionice, and how the CFQ scheduler is able to support processes in different scheduling classes?

Jens Axboe: CFQ is really a hierarchy of queues, where each process has a link to a private queue and an async queue. The async queue is shared between all processes running at the same priority level, and handles things like dirty data write back. The private queue handles synchronous requests for the process. When a process is due for disk service, it gets assigned a time slice in which it has priority access to the disk. This time slices varies in length depending on the process of the queue, and higher priority queues also get more frequent service than their lower equivalents.

So the above explains the priority level within a class. Now, scheduling classes affect the way CFQ selects which process to service as well. Basically there are 3 classes available - idle, best-effort, and real time. The idle class only gets service when nobody else needs the drive and hasn't used it for a while. It's meant for background jobs where you don't really care how long they take to complete, as long as they complete eventually. Think jobs forked from cron and that type of thing. The best-effort class is the default class. Queues from that class are served in a round robin fashion, with the priority levels being serviced like explained above. The real time class works like the best effort class, except that it gets priority access whenever it has work to do. The above is a high level overview of how CFQ handles priorities, so it's somewhat simplified. I'd invite anyone interested in learning more about this or even contributing to read the source.

JA: What sorts of tunables does the CFQ scheduler have?

Jens Axboe: It has a few tunables to control the length of time slices, set the expiration time of new requests in the FIFO, and controls to modify the one-way disk scan slightly by allowing short backwards seeks (at a penalty). Generally there aren't that many of them and tuning should not be required to get good performance, I'd consider it a bug if that was the case. That said, someone may wish to eg give writes a higher priority than they currently have. Or have other special requirements that don't apply in general, and they should have the means to tweak CFQ to get that behavior.

The tunables for the IO scheduler attached to a device can be found in sysfs. If you wish to muck around with the settings for sda for instance, you would look in /sys/block/sda/queue/iosched/.

Laptop Mode:
JA: Shortly after posting your original CFQ scheduler patch, you posted a patch introducing the "laptop_mode" sysctl that has also since been merged into the mainline Linux kernel. How does laptop mode improve battery life on laptops?

Jens Axboe: Laptop mode tries to keep the hard drive idle for as long as possible, increasing the amount of time you can keep the drive spun down at the time. The standard Linux kernel writes out dirty data at fixed 5 second intervals, making spinning the drive down and saving power impossible. Laptop mode tries to defer this background write back to occasions when we need to spin the drive up anyway - normally because we need to read some data for the user. After these reads have completed, the existing dirty data is written back immediately and the drive can be spun down again.

Splice:
JA: What other kernel projects have you focused on?

Jens Axboe: Basically anything that relates to IO is interesting to me! The last major kernel project I got into was splice, a new IO model based on a paper by Larry McVoy. I had read the paper many years ago, and while the idea was innovative and appealing, I felt there was a piece missing to really tie it into the kernel model. Splice describes a way to allow applications to move data around inside the kernel, without copying it back and forth between the kernel and user space. Essentially, you splice together two ends and allow the data to travel between them. Linus provided the missing piece of the puzzle, by suggesting that the splice buffers be tied to pipes. Like most good ideas, it is directly obvious once you understand it! So once that was settled, I wrote the kernel implementation and the associated system calls. There's a system call (sys_splice) that splices data from a file descriptor to a pipe (or vice versa), a system call to duplicate the contents of one pipe to another (sys_tee), and a system call that maps a user buffer into a pipe.

Splice has a host of applications. It can completely replace the bad hack that is sendfile(), which is an extremely limited zero copy interface for sending a file over the network. The neat thing about using pipes as the buffers, is that you have a known interface to work with and a way to tie things together intuitively. A good and easy to understand example is a live TV setup, where you have a driver for your TV encoder (lets call that /dev/tvcapture) and a driver for your TV decoder (lets call that /dev/tvout. Say you want to watch live TV while storing the contents to a file for pausing or rewind purposes, you could describe that as easy as:

  $ splice-in /dev/tvcapture | splice-tee out.mpg | splice-out /dev/tvout

The first step will open /dev/tvcapture and splice that file descriptor to STDOUT. The second will duplicate the page references from the STDIN pipe, splicing the first to the output file and splicing the second to STDOUT. Finally, the last step will splice STDIN to a file descriptor for /dev/tvout. The data never needs to be copied around, we simply move page references around inside the kernel. It's like building with Lego blocks :-)

JA: So you're essentially allowing multiple simultaneous uses of the same file data?

Jens Axboe: File data or just data in general. splice allows you to move pages of data around inside the system, without doing any copying. Most of of the time. So while you can splice file data to a pipe, you can also map user space data into a pipe using vmsplice. And then splice that pipe to a file, over the network, or whatever you would like to do.

JA: How does this differ from the userland utility, tee?

Jens Axboe: The result is the same, hence the syscall is named sys_tee to indicate that. How it works is quite different. While the userland tee copies data around, when you sys_tee a pipe to another, you are really just pointing the second pipe map contents to the first one while grabbing a reference on the pages in there. So they share the name and functionality (somewhat), but that's about it.

JA: Before splice existed, if you wanted to watch live TV while storing its contents to a file, how would it have worked?

Jens Axboe: You would first have to copy the data to userspace, then write that buffer to a file (causing the data to be copied back into the kernel, unless you were doing raw writes), and finally send the data to the output device driver (again, copying the data to the kernel). So you'd have to touch and copy the data at least two times, where as with splice you don't have to do any copies.

JA: How is splice currently being used?

Jens Axboe: splice is still a relatively new concept, so it hasn't seen any wide spread usage yet. I'm quite excited about the concept though, and I'm quite sure that people will start using it in interesting ways in the future. The networking parts of splice are still somewhat lacking and that is limiting people in what they can do with it, which is unfortunate given that the networking applications are one of the areas where splice can really make a difference. Consider a typical case of a file serving application. If you need to send out some sort of header followed by real data, with the current API you'd need to first write a header file (copying it to the kernel), then either copy the file contents or use sendfile() to transmit it. With splice, you could vmsplice the header contents into a pipe, then splice the file data into that same pipe. Now it contains the stuff you want to send out, simply use splice to send that pipe contents to a socket. So not only is splice more efficient for handling things like this, it's also a much nicer programming model. The latter is just as important.

If you're curious about splice (and the related system calls), I would suggest cloning my splice tools git repository. It's just a small collection of sample usages of splice, along with some benchmark apps I did during development as well. The repository lives here:

Kernel Development Process:
JA: You've been involved with the Linux kernel for a long time. How have the personalities involved changed over the years?

Jens Axboe: The number of consistent major contributors has surely increased, probably mainly due to the fact that a much larger number of people are actually being paid to work on the Linux kernel. Back when I got into kernel development that wasn't the case. Not sure the personalities changed that much - we lost some people, gained some, and others just moved to different roles in the kernel community. If I were to make a generalization of some sort, I'd say that generally it's a more mature crowd these days.

JA: Does the kernel at any point become feature complete, or do you always see it evolving and improving?

Jens Axboe: It's been evolving and improving for many years now, and I don't see it slowing down. Quite the opposite. So I'm sure that there will still be innovation going on 5 years from now. Sometimes this is driven by evolving hardware, for instance. And there's always room for improvement.

JA: How has the usage of git to manage the Linux source code changed the development process for you?

Jens Axboe: Not by a huge amount, really. The biggest change was going from a pure patch based management approach to BitKeeper some years ago. That really helped not only development of new features, it also made it a lot easier to track down regressions by browsing through actual changesets instead of individual file level diffs. With that said, I do prefer working with git over BK. These days I have my various development branches inside the same git repository and it works really well. While I could have done something similar with BK, git is just a lot more usable to me. So while I do keep patchsets in git branches now and ask Linus to pull them instead of sending a bunch of mails with patches, I could have done the same with BK as well.

JA: Do you think we'll ever see a 2.7 or 3.0 Linux kernel?

Jens Axboe: It's not likely. Linus has essentially said, that a 2.7 kernel will not happen unless someone pries control of the kernel out of his hands. And I don't see that happening anytime soon. The new development model works really well, in my opinion. So I don't see a need for a 2.7 kernel either. We are getting things done at a much faster rate than before, there's just no comparison.

JA: In your opinion, with the increased rate of development happening on the 2.6 kernel, has it remained stable and reliable?

Jens Axboe: I think so. With the new development model, we have essentially pushed a good part of the serious stabilization work to the distros. Back when I was working for SUSE, we spend a lot of time testing, benchmarking, and stabilizing a given 2.6 kernel for a new release. Then those patches got pushed back upstream and end up in the Linus kernel anyway. In the 2.4 days we did the same thing, but the 2.4 distro trees were so huge and different from both each other and the mainline tree, that it was hard to share the work we did there. So I think 2.6 has improved this quite a bit. With the 2.4 kernels we had to keep everything in vendor trees because it was impossible to get everything we wanted pushed into the conservative 2.4 branch. So while we do push more development work into the 2.6 kernels today, we also share a lot of the fixing work between us. I think the two offset each other quite nicely, and we get the benefit of much more unified 2.6 branches and the features we want.

JA: How important is the GPL to you when you contribute code to the Linux kernel?

Jens Axboe: I'm not really a political person by nature, but the GPL is important to me. The reason I got into open source in the first place was because of the knowledge sharing. I like giving away my ideas and code for free, but I'd also prefer if others do the same. Especially if they base projects or include some of my code, then I really don't want to allow them to run away with that and never give anything back. So the GPL appeals to me as a license, much more so than the BSD class of licenses.

JA: Speaking of the BSD license, have you looked at any of the BSD project's IO code and compared it with what is done in Linux?

Jens Axboe: No, I have not. I did notice at some point that we seem to share eg the 'bio' name for a block IO unit, but apart from that I don't know what their code looks like. I do know that the BSD's don't have any advanced IO scheduler implementation, and that they based their IO units on virtual addresses and lengths where Linux is completely scatter gather based. Apart from that, I'm completely BSD ignorant.

JA: What advice would you offer to readers who are interested in getting involved in the development of the Linux Kernel?

Jens Axboe: Follow the Linux kernel mailing list and find some project that is interesting to you personally. It's important to have that personal itch to scratch, otherwise it can be hard to stay motivated. Getting into kernel development has a pretty steep learning curve, so you need all the motivation you can get. Then just dive in and try to fix or solve that problem. Arm yourself with your favorite editor and cscope for source code browsing as a help, and you are all set to become the next valued contributor!

JA: How do spend your time when you're not hacking on the Linux kernel?

Jens Axboe: These days I'm a family man, so I like to spend as much time as I can with my son and my wife. If going out, I enjoy going to concerts. Or just enjoying some quiet leisure time with friends.

JA: Is there anything else you'd like to add?

Jens Axboe: Share the code!

JA: Thank you for all your time answering these questions, and for all your work on the Linux kernel!

Jens Axboe: You're very welcome, my pleasure.

Related Links:

()