Documentation

Overview
Provides an introduction to the problem addressed by the spdiff tool.
Examples
Simple example illustrating generic patch inference as well as examples of real-world usage.
Manual
Explains how to use the spdiff tool.
Papers
List of papers of a more technical nature.

Overview

Is this tool for you?

If you can answer yes to most of the following questions, then you might find this tool (or the general problem addressed by it) interesting.

The problem

The basic problem addessed by the spdiff tool is automatic inference of high level specifications of changes to structured data. We have mainly been concerned with C programs for the time being, but there seems to be no reason why the techniques described here could not also apply to other forms of structured data than C programs.

In our setting a set of programs and their updated versions is given as a set of pairs of programs

C = { (p_1,p_1'), ... , (p_n,p_n') }

The goal is now to find a specification of what changes were applied in all programs. This specification can serve as both a high level documentation of the common changes applied to the programs but also as the base for transforming other programs similarly.

For a more comprehensive description there is a paper to read: pdf. Eventually, more papers will appear in the "papers" section of this web page: go to papers section.

To get an intuition for what spdiff can examples can be useful...

Examples

Suppose we have the following three C programs. The programs have been constructed to illustrate points not to be interesting or sensible programs.

Program 1 Program 2 Program 3
int foo(void) {
  int x;
  f(x);
  x = f(x);
  x = g(117);
  return x;
}
int bar(int y) {
  int a;
  a = f(a);
  a = f(a)+g(y);
  return a;
}
int smoz(int y) {
  int z;
  z = f(z);
  z = f(z);
  return z;
}

Now, further suppose that each program is updated into the following:

Program 1' Program 2' Program 3'
int foo(void) {
  int x;
  f(x);
  x = f(x, 42);
  x = h(g(117));
  return x+x;
}
int bar(int y) {
  int a;
  a = f(a,42);
  a = f(a)+g(y,y);
  return a+a;
}
int smoz(int y) {
  int z;
  z = f(z);
  z = f(z, 42);
  return z+z;
}

The changes to each file can be shown using the diff program:

diff p1.c p1-new.c diff p2.c p2-new.c diff p3.c p3-new.c
 int foo(void) {
   int x;
   f(x);
-  x = f(x);
-  x = g(117);
-  return x;
+  x = f(x, 42);
+  x = h(g(117));
+  return x+x;
 }
 int bar(int y) {
   int a;
-  a = f(a);
-  a = f(a)+g(y);
-  return a;
+  a = f(a,42);
+  a = f(a)+g(y,y);
+  return a+a;
 }
 int smoz(int y) {
   int z;
   z = f(z);
-  z = f(z);
-  return z;
+  z = f(z, 42);
+  return z+z;
 }

By close inspection of the diffs above we can see that some of the changes made in each program look related. In particular some calls to the function f were given a constant extra parameter 42 and in the return-statements the value is added to itself before being returned.

Consider just the changes made to the return statements:

diff p1.c p1-new.c diff p2.c p2-new.c diff p3.c p3-new.c
-  return x;
+  return x+x;
-  return a;
+  return a+a;
-  return z;
+  return z+z;

If diffs had an abstraction mechanism it would be compelling to abstract away the concrete expression returned and somehow use an abstract placeholder (metavariable) instead. (see also the Coccinelle website for more details on abstract patches: Coccinelle)

@@
signed int X0;
@@
- return X0;
+ return X0 + X0;

Applying the above diff to any one of the (original) programs will update the return-statement in the expected manner.

Now consider the rest of the changes:

diff p1.c p1-new.c diff p2.c p2-new.c diff p3.c p3-new.c
-  x = f(x);
-  x = g(117);
+  x = f(x, 42);
+  x = h(g(117));
-  a = f(a);
-  a = f(a)+g(y);
+  a = f(a,42);
+  a = f(a)+g(y,y);
   z = f(z);
-  z = f(z);
+  z = f(z, 42);

It looks like calls to f got an added constant 42 so one could suggest the following abstract transformation:

@@ 
signed int X0;
@@
- X0 = f(X0);
+ X0 = f(X0,42);

However, the above patch does not update all program in the expected manner!

When applying to the program p3.c both of its calls to f(z) will be updated into f(z,42), but only one of them should have been updated.

In order to construct a patch that will update all three programs correctly, we need to find some common property that holds of all the X0=f(X0)-statements that should be updated but not of any other X0=f(X0)-statements.

One candidate for such a property is : "All of the statements matching X0=f(X0) are always followed by a return X0-statement."

Looking again at program p3.c we see that both of the X0=f(X0)-statements have the above property, but if we strengthen the property to additionally say "... and there can be no X0=f(X0)-statement in between." we get a property that fits the bill.

The way we encode the found property in a patch is by using "..." (yes: dot dot dot) to denote such paths:

-	X0=f(X0);
+	X0=f(X0,42);
	...
	return X0;

Putting it all together

We have now found two patches that were common in all the changes made in each of the three programs. We just need to put the two patches together somehow.

One can see that the latter of the two common patches found rely on finding a return-statement in order to apply. Now recall that the first common patch we found applies to the return-statements, so we should apply the first patch after the latter.

However, we can also simply merge the two patches into one and get a patch that has the same effect as applying the two consecutively:

-	X0=f(X0);
	...
+	X0=f(X0,42);
-	return X0;
+	return X0 + X0;

Manual

Functionalities provided

spdiff has 4 modes of operation:

Generic Patch Inference
Finds common changes that can be expressed as a sequence of context-free term-rewrites.
Semantic Patch Inference
Finds common changes that can have certain context-sensitive (temporal) dependencies.
Semantic Pattern Inference
Finds common context-sensitive patterns in a given set of functions.
Inference of abstract function definition
Constructs an abstraction of a given set of functions.

Throughout the following sections we will use the three programs shown previously. We assume that a file named specfile is given with the following contents:

p1.c  p1-new.c
p2.c  p2-new.c
p3.c  p3-new.c

Generic Patch Inference

This is the default mode of operation for spdiff. In this mode, spdiff looks for generic patches that occur threshold number of times in the given set of functions.

What is a generic patch

Example usage

$ spdiff -specfile specfile

The resulting generic patch printed is then...

@@
signed int X0;
@@
- return X0;
+ return X0 + X0;

The resulting generic patch tells us that in all functions we can safely modify the return-statement as shown.

Recall from the discussion above [see example] that in program p3.c there was a statement z = f(z); that did not change so the suggestion to simply rewrite all such statements into X0 = f(X0,42); was considering unsafe.

We can specify to spdiff that a patch need only be safe for a certain number of functions before it is returned as part of the results using the -threshold parameter:

$ spdiff -specfile specfile -threshold 2 -prune
@@
signed int X0;
@@
- X0=f(X0);
+ X0=f(X0,42);
@@
signed int X0;
@@
- return X0;
+ return X0 + X0;

In the example we also specify the -prune parameter. The effect is that spdiff tries not to report several equivalent generic patches.

Semantic Patch Inference

To find semantic patches (see the Coccinelle pages for more details on semantic patches: Coccinelle) one should give the -spatch parameter to spdiff.

Example usage

$spdiff -specfile specfile -spatch
[spatch:]
-	X0=f(X0);
	...
+	X0=f(X0,42);
-	return X0;
+	return X0 + X0;
[spatch:]
-	X0=f(X0);
+	X0=f(X0,42);
	...
	return X0;

...

$ spdiff -specfile specfile -spatch -prune -filter_spatches
-	X0=f(X0);
	...
+	X0=f(X0,42);
-	return X0;
+	return X0 + X0;

Semantic Pattern Inference

Example usage

In the command line shown below, one can note that -specfile is not used. When the -specfile parameter is not given, spdiff assumes that it should use a default specfile with the name specfile.

$ spdiff -patterns -prune -filter_patterns
[Main] *Common* patterns found:
[[[
return X0;
]]]
[[[
X0=f(X0);
...
return X0;
]]]
[[[
signed int X1;;
...
X1=f(X1);
]]]
[[[
signed int X0;;
...
return X0;
]]]

Inference of abstract function definition

Example usage

TODO: Need compelling example...

$ spdiff -specfile specfile_c -fun_common

Summary of Commandline parameters

The spdiff tool accepts a number of command line parameters as shown below.

Usage: 
  -specfile [filename]    name of specification file (defaults to "specfile") 
  -threshold [num]        the minimum number of occurrences required
  -noif0_passing          also parse if0 blocks
  -print_abs              print abstract updates for each term pair
  -relax_safe             consider non-application safe [experimental]
  -print_inline           print types of identifiers inline
  -print_raw              print the raw list of generated simple updates
  -print_uniq             print the unique solutions before removing smaller ones
  -print_add              print statement when adding a new solution
  -prune                  try to prune search space by various means
  -strip_eq               use eq_classes for initial atomic patches
  -patterns               look for common patterns in LHS files
  -spatch                 find semantic patches (not generic)
  -only_changes           only look for patterns in changed functions
  -verbose                print more intermediate results
  -filter_patterns        only produce largest patterns
  -no_malign              *DON'T* use the edit-dist subpatch relation definition
  -filter_spatches        filter non-largest spatches
  -macro_file [filename]  default macros
  -fun_common             infer one abstraction of all functions given
  -print_support          whether to also print the concrete matched functions for fun_common
  -cache                  only print the term pairs to stdout
  -read_generic           input is given in gtree-format
  -max_embedding          how deep to look inside terms when discovering nested patterns
  -tmp                    find embedded PATCHES also
  -show_support           show the support of each sem. patch inferred
  -flow_support           threshold required of flow-patterns
  -help                   Display this list of options
  --help                  Display this list of options

Notes about Unparsing

The pretty printing part of spdiff is currently not very faithful to the original input programs.

Programs are translated to an internal tree structured representation which has the following abstract syntax:

gt      ::= A(type,value) | C(type, gt_list)
gt_list ::= gt | gt, gt_list

A printing function for this simple structure is defined as follows (where type information is not completely used):

let pp_tree t = match t with
  | A(t,v) -> v
  | C(t,gts) -> t ^ "[" ^ concat "," (map pp_tree gts) ^ "]"

For some constructs (such as function calls) the printing function has special cases not shown above which makes the printed result look more like valid C code.

If you find a case where you think that the printing should have better printing, don't hesitate to send me an email: jespera@diku.dk.

Technical paper(s)

The following is a list of papers related to the spdiff tool.

Generic Patch Inference

Abstract: A key issue in maintaining Linux device drivers is the need to update drivers in response to evolutions in Linux internal libraries. Currently, there is little tool support for performing and documenting such changes... [pdf, bibtex]