What exactly does the weird Linux kernel container_of()
macro do? This macro is defined as:
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr);
(type *)( (char *)__mptr - offsetof(type,member) );})
To help explain pointer mess, let us go through the steps of what it is doing. In the include/linux/i2c.h
file, the to_i2c_driver()
macro is defined as:
container_of(d, struct i2c_driver, driver)
and is used in code as:
i2c_drv = to_i2c_driver(drv);
where dev
is a pointer to a struct device_driver
.
Replacing the above code with the first macro expansion produces:
i2c_drv = container_of(drv, struct i2c_driver, driver);
Then, the next level of expansion creates:
i2c_drv = ({
const typeof( ((struct i2c_driver *)0)->driver) *__mptr = drv;
(struct i2c_driver *)( (char *)__mptr - offsetof(struct i2c_driver, driver));
})
To simplify this, remember that driver
is a variable within the i2c_driver
function. The first line of the macro sets up a pointer that points to the struct device_driver
passed to the code. The second line of the macro finds the real location in memory of the struct i2c_driver
that we want to access.
So with the type of the driver
variable known, the macro can be reduced to:
i2c_drv = ({
const struct device driver *__mptr = drv;
(struct i2c_driver *)( (char *)__mptr - offsetof(struct i2c_driver, driver));
})
To show this using real numbers, say the i2c_driver structure looks like:
struct i2c_driver {
char name[32];
struct device_driver driver;
};
The location of the driver variable usually is 32 bytes into the
structure, depending on the packing rules of the compiler at the time.
So, if the drv
pointer is at location 0xc0101020
, the __mptr
variable is assigned that location in the first line of the macro.
Then, the offset of the driver variable is subtracted from this
address, giving us the value 0xc0101000
. This would be assigned to the variable on the left side of the original assignment, i2c_drv
.
Putting these values into the macro, it then is reduced to:
i2c_drv = ({
const struct device_driver *__mptr = drv;
(struct i2c_driver *)( (char *)__mptr - 0x20);
})
which is then evaluated at runtime by the processor as a simple subtraction on a pointer, which should be quite fast.
In order to do this kind of pointer manipulation, the code has
to know the type of pointer being passed to it. The driver core only
passes in the type of driver structure registered with it, so this type
of manipulation is safe. This also prevents other parts of the kernel
from modifying the unique fields of the structure used to control the
subsystem's code.
(2)likely() ,unlikely()
What are they ?
In Linux kernel code, one often find calls to likely() and unlikely(), in conditions, like :
bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx);
if (unlikely(!bvl)) {
mempool_free(bio, bio_pool);
bio = NULL;
goto out;
}
In fact, these functions are hints for the compiler that allows it
to correctly optimize the branch, by knowing which is the likeliest
one. The definitions of these macros, found in include/linux/compiler.h are the following :
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
The GCC documentation explains the role of __builtin_expect() :
-- Built-in Function: long __builtin_expect (long EXP, long C)
You may use `__builtin_expect' to provide the compiler with branch
prediction information. In general, you should prefer to use
actual profile feedback for this (`-fprofile-arcs'), as
programmers are notoriously bad at predicting how their programs
actually perform. However, there are applications in which this
data is hard to collect.
The return value is the value of EXP, which should be an integral
expression. The value of C must be a compile-time constant. The
semantics of the built-in are that it is expected that EXP == C.
For example:
if (__builtin_expect (x, 0))
foo ();
would indicate that we do not expect to call `foo', since we
expect `x' to be zero. Since you are limited to integral
expressions for EXP, you should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
error ();
when testing pointer or floating-point values.
How does it optimize things ?
It optimizes things by ordering the generated assembly code
correctly, to optimize the usage of the processor pipeline. To do so,
they arrange the code so that the likeliest branch is executed without
performing any jmp instruction (which has the bad effect of flushing the processor pipeline).
To see how it works, let's compile the following simple C user space program with gcc -O2 :
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
int main(char *argv[], int argc)
{
int a;
/* Get the value from somewhere GCC can't optimize */
a = atoi (argv[1]);
if (unlikely (a == 2))
a++;
else
a--;
printf ("%d\n", a);
return 0;
}
Now, disassemble the resulting binary using objdump -S (comments added by me) :
080483b0 :
// Prologue
80483b0: 55 push %ebp
80483b1: 89 e5 mov %esp,%ebp
80483b3: 50 push %eax
80483b4: 50 push %eax
80483b5: 83 e4 f0 and $0xfffffff0,%esp
// Call atoi()
80483b8: 8b 45 08 mov 0x8(%ebp),%eax
80483bb: 83 ec 1c sub $0x1c,%esp
80483be: 8b 48 04 mov 0x4(%eax),%ecx
80483c1: 51 push %ecx
80483c2: e8 1d ff ff ff call 80482e4
80483c7: 83 c4 10 add $0x10,%esp
// Test the value
80483ca: 83 f8 02 cmp $0x2,%eax
// --------------------------------------------------------
// If 'a' equal to 2 (which is unlikely), then jump,
// otherwise continue directly, without jump, so that it
// doesn't flush the pipeline.
// --------------------------------------------------------
80483cd: 74 12 je 80483e1
80483cf: 48 dec %eax
// Call printf
80483d0: 52 push %edx
80483d1: 52 push %edx
80483d2: 50 push %eax
80483d3: 68 c8 84 04 08 push $0x80484c8
80483d8: e8 f7 fe ff ff call 80482d4
// Return 0 and go out.
80483dd: 31 c0 xor %eax,%eax
80483df: c9 leave
80483e0: c3 ret
Now, in the previous program, replace the unlikely() by a likely(), recompile it, and disassemble it again (again, comments added by me) :
080483b0 :
// Prologue
80483b0: 55 push %ebp
80483b1: 89 e5 mov %esp,%ebp
80483b3: 50 push %eax
80483b4: 50 push %eax
80483b5: 83 e4 f0 and $0xfffffff0,%esp
// Call atoi()
80483b8: 8b 45 08 mov 0x8(%ebp),%eax
80483bb: 83 ec 1c sub $0x1c,%esp
80483be: 8b 48 04 mov 0x4(%eax),%ecx
80483c1: 51 push %ecx
80483c2: e8 1d ff ff ff call 80482e4
80483c7: 83 c4 10 add $0x10,%esp
// --------------------------------------------------
// If 'a' equal 2 (which is likely), we will continue
// without branching, so without flusing the pipeline. The
// jump only occurs when a != 2, which is unlikely.
// ---------------------------------------------------
80483ca: 83 f8 02 cmp $0x2,%eax
80483cd: 75 13 jne 80483e2
// Here the a++ incrementation has been optimized by gcc
80483cf: b0 03 mov $0x3,%al
// Call printf()
80483d1: 52 push %edx
80483d2: 52 push %edx
80483d3: 50 push %eax
80483d4: 68 c8 84 04 08 push $0x80484c8
80483d9: e8 f6 fe ff ff call 80482d4
// Return 0 and go out.
80483de: 31 c0 xor %eax,%eax
80483e0: c9 leave
80483e1: c3 ret
How should I use it ?
You should use it only in cases when the likeliest branch is very
very very likely, or when the unlikeliest branch is very very very
unlikely.