-- linux爱好者,业余时间热衷于分析linux内核源码 -- 目前主要研究云计算和虚拟化相关的技术,主要包括libvirt/qemu,openstack,opennebula架构和源码分析。 -- 第五届云计算大会演讲嘉宾 微博:@Marshal-Liu
分类: LINUX
2010-03-26 21:31:48
is a library for doing static analysis on large C projects. It is interesting because unlike other tools around, it has actually been applied to large projects in the real world (such as the Linux kernel), it supports most gcc extensions, C99, and many subtleties of the C language. Another thing which makes it interesting is that it is distributed under a free license (specifically, BSD).
In this document I describe how I used it to analyze .
Static analysis is the process of loading the source code for a program into memory, then analyzing or transforming it in various ways. The most common programs around which do static analysis are compilers: they both analyze the program for warnings, and transform it (optimization, conversion to machine code).
Compilers such as gcc nowadays do sophisticated analyses on programs -- reachability, dataflow, pointer aliasing, and so on. However they are not customizable. You can't stop the compiler during its analysis phase and add your own constraints. Some of the things you might want to add are:
A()
and B()
are always called in that order, and never B(); A();
.
errno
whenever I
return an error indication from a library function.
versustesting
Static analysis is concerned with the program at compile time. Testing observes the program's behaviour at runtime. As such static analysis complements testing, because each can find bugs in the code that the other misses.
Because the rules you want to apply are likely to be quite specific to your C program or library, you will necessarily end up having to write up these rules in some sort of programming language. For the Stanford Checker, rules were written in a special language called "MC". For CIL you have to write them up in .
CIL does not support C++. There are other solutions which work for C++, including / .
Static analysis of C code is useful, but if you are thinking of starting a new project, choose a sensible language! Don't start new projects in C or C++, or dynamic languages like Perl and Python which leave errors in code for customers to find.
A leading commercial (and very proprietary) tool for static analysis of C programs is , which started life as the Stanford Checker. Early in 2001 postings started to appear on LKML detailing bugs in the Linux kernel which had been found through static analysis (for example: and ).
It is highly recommended that you read the first few chapters of the CIL documentation.
CIL reads C source code after it has gone through the C preprocessor. It performs some transformations on the code to make analysis simpler, and it turns the C source code into memory structures which describe the name, type and contents of every variable and function definition (and everything else) in the program.
CIL provides a drop-in compatible replacement for gcc
called cilly
which you can use in Makefiles
(eg. make CC=cilly
). It also provides library functions
that you can use to import preprocessed C files and iterate over their
structural contents.
CIL provides libraries for doing common transformations and analyses on programs. Some of the more important ones are:
Note two things: Firstly compilers like GCC already do these sorts of transformations and analyses, however not in a way that you can use or extend. Secondly these are only preludes to writing your own program-specific analysis programs. Typically you will use the library to load the source and do at least CFG analysis on it, before you start to do your own analysis.
For working code, please refer to the section at the top of this document.
Before embarking on libvirt, I wrote a few lines of test code in C,
split across a couple of files, and loaded those into CIL using the
library functions. I preprocessed them using /bin/cpp
first, as required by CIL. The OCaml code to load the C code is just:
let files = [ "test1.i"; "main.i" ]
let files = List.map (fun filename -> (Frontc.parse filename) ()) files
The two files are then merged together in memory (kind of like linking), and unused prototypes and definitions are removed. The unused stuff mainly comes from libc header files.
let file = Mergecil.merge files "test"
let () = Rmtmps.removeUnusedTemps file
Finally I call the CIL function to perform control flow graph (CFG) analysis which will later allow me to produce some pretty pictures of the functions:
let () = Cfg.computeFileCFG file
Now as a test I wrote some analysis code. This code simply iterates over the top level expressions in the program, finds function definitions, and prints out the control flow graphs of the functions as GraphViz dot-files. It ignores anything which isn't a function definition:
List.iter (
function
| GType _ -> () (* typedef *)
| GCompTag _ -> () (* struct/union *)
| GCompTagDecl _ -> () (* forward prototype of struct/union *)
| GEnumTag _ -> () (* enum *)
| GEnumTagDecl _ -> () (* forward prototype of enum *)
| GVarDecl _ -> () (* variable/function prototype *)
| GVar _ -> () (* variable definition *)
| GFun (fundec, loc) -> (* function definition *)
printf "%s:%d: %s has %d arg(s)\n"
loc.file loc.line fundec.svar.vname (List.length fundec.sformals);
(* Write control flow graph to a GraphViz dot-file: *)
Cfg.printCfgFilename (fundec.svar.vname ^ ".dot") fundec
| GAsm _ -> () (* global asm statement *)
| GPragma _ -> () (* toplevel #pragma *)
| GText _ -> () (* verbatim text, comments *)
) file.globals;
When you run this program (make run-ciltest1
), you see:
test1.c:11: test1_add has 2 arg(s)
test1.c:22: test1_inc has 1 arg(s)
test1.c:36: test1_print has 1 arg(s)
test1.c:48: test1_last_error has 0 arg(s)
test1.c:56: test1_print_last_error has 0 arg(s)
main.c:10: main has 0 arg(s)
and here are some of the GraphViz dot-files rendered next to their corresponding functions:
Control flow graph | Function |
---|---|
int |
|
void |
|
int
In the corresponding graph for this function we see
two |
For working code, please refer to the section at the top of this document.
Before we can perform any analysis, we first have to load the libvirt
C code into CIL. I found it simpler to avoid the
cilly
drop-in replacement for gcc, and instead just
compile libvirt like this:
make CC="gcc "
This builds libvirt in the ordinary way, but leaves around the
preprocessed C files (*.i
), and these can be directly
loaded into CIL using the library functions.
There are some fragments of shell script in the Makefile which
are just there to build libvirt with the special CC variable,
find the list of *.i
files, and get the list of
exported symbols (the entry points into libvirt). Interested
readers can consult the Makefile rule run-cilimportvirt
for details.
In any case, once these housekeeping tasks are done, the whole of
libvirt was loaded into CIL, barring one problem: CIL doesn't yet
implement , so I had to modify libvirt's build to add
-D_Bool=unsigned
. Still, a single change isn't bad
considering that libvirt is a large body of code written by a whole
variety of authors which uses C99 and gcc extensions throughout.
Control flow graphs of real functions in libvirt are considerably more
complicated than my test functions. For example on the right is the
CFG of function xenHypervisorMakeCapabilitiesXML
. The
Postscript file produced by GraphViz actually crashes most file
viewers.
As a simple example of a static analysis, I used CIL along with my list of exported function names to extract the argument type(s), return type and point of definition of each exported function. The output looks like this:
virterror.c:70: virGetLastError exported function with 0 arg(s)
virErrorPtr (void)
virterror.c:88: virCopyLastError exported function with 1 arg(s)
int (virErrorPtr to )
virterror.c:105: virResetError exported function with 1 arg(s)
void (virErrorPtr err )
virterror.c:126: virResetLastError exported function with 0 arg(s)
void (void)
[...]
libvirt.c:56: virInitialize exported function with 0 arg(s)
int (void)
libvirt.c:336: virGetVersion exported function with 3 arg(s)
int (unsigned long *libVer , char const *type , unsigned long *typeVer )
libvirt.c:443: virConnectOpen exported function with 1 arg(s)
virConnectPtr (char const *name )
libvirt.c:461: virConnectOpenReadOnly exported function with 1 arg(s)
virConnectPtr (char const *name )
libvirt.c:478: virConnectClose exported function with 1 arg(s)
int (virConnectPtr conn )
libvirt.c:504: virConnectGetType exported function with 1 arg(s)
char const *(virConnectPtr conn )
libvirt.c:534: virConnectGetVersion exported function with 2 arg(s)
int (virConnectPtr conn , unsigned long *hvVer )
libvirt.c:566: virConnectGetHostname exported function with 1 arg(s)
char *(virConnectPtr conn )