Documentation
Overview
Is this tool for you?
If you can answer yes to most of the following questions, then you might find this tool (or the general problem addressed by it) interesting.
- Have you ever developed a program that made use of some library?
- Have you ever had to update your program because of changes in the library you used?
- Have you ever been frustrated that the required changes you had to make in your program because of the mentioned library changes were not clearly documented?
- Have you ever felt annoyed that none of the many other people using the same library decided to document how they adapted their programs to the new version of the library?
The problem
The basic problem addessed by the spdiff tool is automatic inference of high level specifications of changes to structured data. We have mainly been concerned with C programs for the time being, but there seems to be no reason why the techniques described here could not also apply to other forms of structured data than C programs.
In our setting a set of programs and their updated versions is given as a set of pairs of programs
C =
{ (p_1,p_1'), ... , (p_n,p_n') }
The goal is now to find a specification of what changes were applied in all programs. This specification can serve as both a high level documentation of the common changes applied to the programs but also as the base for transforming other programs similarly.
For a more comprehensive description there is a paper to read: pdf. Eventually, more papers will appear in the "papers" section of this web page: go to papers section.
To get an intuition for what spdiff can
examples can be useful...
Examples
Suppose we have the following three C programs. The programs have been constructed to illustrate points not to be interesting or sensible programs.
| Program 1 | Program 2 | Program 3 |
|---|---|---|
int foo(void) {
int x;
f(x);
x = f(x);
x = g(117);
return x;
}
|
int bar(int y) {
int a;
a = f(a);
a = f(a)+g(y);
return a;
}
|
int smoz(int y) {
int z;
z = f(z);
z = f(z);
return z;
}
|
Now, further suppose that each program is updated into the following:
| Program 1' | Program 2' | Program 3' |
|---|---|---|
int foo(void) {
int x;
f(x);
x = f(x, 42);
x = h(g(117));
return x+x;
}
|
int bar(int y) {
int a;
a = f(a,42);
a = f(a)+g(y,y);
return a+a;
}
|
int smoz(int y) {
int z;
z = f(z);
z = f(z, 42);
return z+z;
}
|
The changes to each file can be shown using the diff
program:
diff p1.c p1-new.c |
diff p2.c p2-new.c |
diff p3.c p3-new.c |
|---|---|---|
int foo(void) {
int x;
f(x);
- x = f(x);
- x = g(117);
- return x;
+ x = f(x, 42);
+ x = h(g(117));
+ return x+x;
}
|
int bar(int y) {
int a;
- a = f(a);
- a = f(a)+g(y);
- return a;
+ a = f(a,42);
+ a = f(a)+g(y,y);
+ return a+a;
}
|
int smoz(int y) {
int z;
z = f(z);
- z = f(z);
- return z;
+ z = f(z, 42);
+ return z+z;
}
|
By close inspection of the diffs above we can see that some of the
changes made in each program look related. In particular some calls to
the function f were given a constant extra parameter
42 and in the return-statements the value is
added to itself before being returned.
Consider just the changes made to the return statements:
diff p1.c p1-new.c |
diff p2.c p2-new.c |
diff p3.c p3-new.c |
|---|---|---|
- return x; + return x+x; |
- return a; + return a+a; |
- return z; + return z+z; |
If diffs had an abstraction mechanism it would be
compelling to abstract away the concrete expression returned and
somehow use an abstract placeholder (metavariable) instead.
(see also the Coccinelle website for more details on abstract patches:
Coccinelle)
@@ signed int X0; @@ - return X0; + return X0 + X0;
Applying the above diff to any one of the (original) programs will
update the return-statement in the expected manner.
Now consider the rest of the changes:
diff p1.c p1-new.c |
diff p2.c p2-new.c |
diff p3.c p3-new.c |
|---|---|---|
- x = f(x); - x = g(117); + x = f(x, 42); + x = h(g(117)); |
- a = f(a); - a = f(a)+g(y); + a = f(a,42); + a = f(a)+g(y,y); |
z = f(z); - z = f(z); + z = f(z, 42); |
It looks like calls to f got an added constant
42 so one could suggest the following abstract
transformation:
@@ signed int X0; @@ - X0 = f(X0); + X0 = f(X0,42);
However, the above patch does not update all program in the expected manner!
When applying to the program p3.c both of its
calls to f(z) will be updated into f(z,42), but only
one of them should have been updated.
In order to construct a patch that will update all three programs
correctly, we need to find some common property that holds of all the
X0=f(X0)-statements that should be updated but not of any
other X0=f(X0)-statements.
One candidate for such a property is : "All of the statements
matching X0=f(X0) are always followed by a return
X0-statement."
Looking again at program p3.c we see that both of the
X0=f(X0)-statements have the above property, but if we
strengthen the property to additionally say "... and there can be no
X0=f(X0)-statement in between." we get a property that fits
the bill.
The way we encode the found property in a patch is by using
"..." (yes: dot dot dot) to denote such paths:
- X0=f(X0); + X0=f(X0,42); ... return X0;
Putting it all together
We have now found two patches that were common in all the changes made in each of the three programs. We just need to put the two patches together somehow.
One can see that the latter of the two common patches found rely on
finding a return-statement in order to apply. Now recall
that the first common patch we found applies to the
return-statements, so we should apply the first patch
after the latter.
However, we can also simply merge the two patches into one and get a patch that has the same effect as applying the two consecutively:
- X0=f(X0); ... + X0=f(X0,42); - return X0; + return X0 + X0;
Manual
Functionalities provided
spdiff has 4 modes of operation:
- Generic Patch Inference
- Finds common changes that can be expressed as a sequence of context-free term-rewrites.
- Semantic Patch Inference
- Finds common changes that can have certain context-sensitive (temporal) dependencies.
- Semantic Pattern Inference
- Finds common context-sensitive patterns in a given set of functions.
- Inference of abstract function definition
- Constructs an abstraction of a given set of functions.
Throughout the following sections we
will use the three programs shown previously. We assume that a file
named specfile is given with the following contents:
p1.c p1-new.c p2.c p2-new.c p3.c p3-new.c
Generic Patch Inference
This is the default mode of operation for
spdiff. In this mode, spdiff looks
for generic patches that occur threshold number of
times in the given set of functions.
What is a generic patch
Example usage
$ spdiff -specfile specfile
The resulting generic patch printed is then...
@@ signed int X0; @@ - return X0; + return X0 + X0;
The resulting generic patch tells us that in
all functions we can safely modify the
return-statement as shown.
Recall from the discussion above [see
example] that in program p3.c there
was a statement z = f(z); that did not change so
the suggestion to simply rewrite all such
statements into X0 = f(X0,42); was considering
unsafe.
We can specify to spdiff that a patch need
only be safe for a certain number of functions
before it is returned as part of the results using the
-threshold parameter:
$ spdiff -specfile specfile -threshold 2 -prune
@@ signed int X0; @@ - X0=f(X0); + X0=f(X0,42); @@ signed int X0; @@ - return X0; + return X0 + X0;
In the example we also specify the -prune
parameter. The effect is that spdiff tries not
to report several equivalent generic patches.
Semantic Patch Inference
To find semantic patches (see the Coccinelle pages
for more details on semantic patches: Coccinelle)
one should give the -spatch parameter to
spdiff.
Example usage
$spdiff -specfile specfile -spatch
[spatch:] - X0=f(X0); ... + X0=f(X0,42); - return X0; + return X0 + X0; [spatch:] - X0=f(X0); + X0=f(X0,42); ... return X0;
...
$ spdiff -specfile specfile -spatch -prune -filter_spatches
- X0=f(X0); ... + X0=f(X0,42); - return X0; + return X0 + X0;
Semantic Pattern Inference
Example usage
In the command line shown below, one can note that
-specfile is not used. When the
-specfile parameter is not given,
spdiff assumes that it should use a default
specfile with the name specfile.
$ spdiff -patterns -prune -filter_patterns
[Main] *Common* patterns found: [[[ return X0; ]]] [[[ X0=f(X0); ... return X0; ]]] [[[ signed int X1;; ... X1=f(X1); ]]] [[[ signed int X0;; ... return X0; ]]]
Inference of abstract function definition
Example usage
TODO: Need compelling example...
$ spdiff -specfile specfile_c -fun_common
Summary of Commandline parameters
The spdiff tool accepts a number of command line
parameters as shown below.
Usage: -specfile [filename] name of specification file (defaults to "specfile") -threshold [num] the minimum number of occurrences required -noif0_passing also parse if0 blocks -print_abs print abstract updates for each term pair -relax_safe consider non-application safe [experimental] -print_inline print types of identifiers inline -print_raw print the raw list of generated simple updates -print_uniq print the unique solutions before removing smaller ones -print_add print statement when adding a new solution -prune try to prune search space by various means -strip_eq use eq_classes for initial atomic patches -patterns look for common patterns in LHS files -spatch find semantic patches (not generic) -only_changes only look for patterns in changed functions -verbose print more intermediate results -filter_patterns only produce largest patterns -no_malign *DON'T* use the edit-dist subpatch relation definition -filter_spatches filter non-largest spatches -macro_file [filename] default macros -fun_common infer one abstraction of all functions given -print_support whether to also print the concrete matched functions for fun_common -cache only print the term pairs to stdout -read_generic input is given in gtree-format -max_embedding how deep to look inside terms when discovering nested patterns -tmp find embedded PATCHES also -show_support show the support of each sem. patch inferred -flow_support threshold required of flow-patterns -help Display this list of options --help Display this list of options
Notes about Unparsing
The pretty printing part of spdiff is
currently not very faithful to the original input
programs.
Programs are translated to an internal tree structured representation which has the following abstract syntax:
gt ::= A(type,value) | C(type, gt_list) gt_list ::= gt | gt, gt_list
A printing function for this simple structure is defined as follows (where type information is not completely used):
let pp_tree t = match t with | A(t,v) -> v | C(t,gts) -> t ^ "[" ^ concat "," (map pp_tree gts) ^ "]"
For some constructs (such as function calls) the printing function has special cases not shown above which makes the printed result look more like valid C code.
If you find a case where you think that the printing should have better printing, don't hesitate to send me an email: jespera@diku.dk.
Technical paper(s)
The following is a list of papers related to the spdiff tool.
- Generic Patch Inference
-
Abstract: A key issue in maintaining Linux device drivers is the need to update drivers in response to evolutions in Linux internal libraries. Currently, there is little tool support for performing and documenting such changes... [pdf, bibtex]