PERFORMANCE ENGINEERING

Jyrki Katajainen
Autumn 1998

Last modified 4.5.1999 18:00

Meetings
Credit points
1. Lecture: Introduction
2. Lecture: Programming style
3. Lecture: Software tools
4. Lecture: Reducing instruction count
5. Lecture: Utilizing word parallelism
Extra Lecture: Learning C in four hours
6. Lecture: Improving paging performance
7. Lecture: Improving cache behaviour
8. Lecture: Utilizing processor parallelism
9. Lecture: Utilizing instruction parallelism
10. Lecture: Utilizing computer parallelism

Meetings

7.	Sept	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
14.	Sept	10.15 am 14.00 pm	sal E:3318	E-huset, LTH
21.	Sept	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
28.	Sept	10.15 am 14.00 pm	sal E:3318	E-huset, LTH
5.	Oct	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
12.	Oct	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
19.	Oct	10.15 am 14.00 pm	sal E:3318	E-huset, LTH
26.	Oct	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
2.	Nov	10.15 am 14.00 pm	sal E:1407	E-huset, LTH
9.	Nov	10.15 am 14.00 pm	sal E35B	Jagtvej 155B
16.	Nov	10.15 am 14.00 pm	sal E:1407	E-huset, LTH

Credit points

København

1 point: active participation

x/50 point(s): x% of all home exercises

1 point: both programming exercises

1 point per a lecture note

1 point per a presentation (1 hour)

Total: 04 Danish points

Lund

active participation

both programming exercises

one of the following:

> 50% of all home exercises

a lecture note

a presentation (1 hour)

Total: 5 Swedish points

1. Lecture: Introduction

My main sources of inspiration were the books by Bentley:

Jon Bentley, Programming Pearls, Addison-Wesley Publishing Company, Reading (1986)

Jon Bentley, More Programming Pearls: Confessions of a Coder, Addison-Wesley Publishing Company, Reading (1988)

Home exercises

1. Write a program that computes a minimum spanning tree for a undirected, weighted graph. Your program should be based on Kruskal's algorithm. The emphasis in this exercise is in the correctness and in the style of programming.

2. Keep a diary of the errors you make when solving Exercise 1. What was the most difficult error to find and correct? What did you learn from this error?

3. How would you classify the errors reported for Exercise 2?

4. What is the most difficult programming error you have ever made? Could you describe it briefly? How were you able to recover from it?

2. Lecture: Programming style

Beforehand reading

Rob Pike, Notes on programming in C, Worldwide Web Document (1989)

Home exercises

1. Write a short review of Knuth's implementation of Kruskal's algorithm given in [Donald E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley Publishing Company, Reading (1993), 460468].

2. Consider the string copying routines discussed in the lecture. Test the speed of these routines in your computer by using an optimizing compiler (gcc -O4 or higher). Take an assembler listing of the program (gcc -O4 -S) and try to explain why the slowest routine is slower than the fastest one.

3. Write a string-copying routine that copies a string into another string wordwise, not bytewise as the routines discussed in the lecture. Is your routine faster than the routine available in the library?

4. Given two arrays s and t, both of size n. What is the pure C cost [Jyrki Katajainen and Jesper Larsson Träff, A meticulous analysis of mergesort programs, Proceedings of the 3rd Italian Conference on Algorithms and Complexity, Lecture Notes in Computer Science 1203, Springer-Verlag, Berlin/Heidelberg (1997), 217228] of copying all elements from array t to array s? How much can you improve the program by using loop unrolling?

3. Lecture: Software tools

Beforehand reading

Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick, An execution profiler for modular programs, SoftwareÑPractice and Experience 13 (1983), 671685

Donald E. Knuth, Literate programming, The Computer Journal 27 (1984), 97111

Special guest star

Martin Bruce: Profiling

Home exercises

1. Evaluate briefly the research proposal that I will send to the Danish Natural Science Research Council. Could you think to take part in this project?

2. Consider a simple maximum-finding subroutine that scans through the input array and updates the current maximum when necessary. Assume that the input for this subroutine is a random permutation of the numbers 0, 1, ..., n-1. Let X denote the number of times the value of the current maximum (variable t in my program) is updated. What is the expected value of X? First guess the answer, then profile the program, and finally prove the result analytically. How was your guess? (Hint: To solve puzzles like this, it is often helpful to consult one of Knuth's books [Donald E. Knuth, The Art of Computer Programming, 3 volumes, Addison-Wesley Publishing Company, Reading (19681998)].)

3. Read Column 1 on profilers in [Jon Bentley, More Programming Pearls: Confessions of a Coder, Addison-Wesley Publishing Company, Reading (1988)] and answer to Exercise 2 therein.

4. In [Jon Bentley, More Programming Pearls: Confessions of a Coder, Addison-Wesley Publishing Company, Reading (1988), p. 184 ] Bentley writes: "For faster implementations of prime sieves, see the Communications of the ACM paper by Mairson (September 1977), Gries and Misra (December 1978), and Pritchard (January 1981), or Pritchard's 'Linear prime-number sieves: A family tree' in Science of Computer Programming, vol. 9, pp. 1735, 1987." Go to the library and tell what is your opinion on the words "faster implementations".

4. Lecture: Reducing instruction count

Beforehand reading

Donald E. Knuth, An empirical study of FORTRAN programs, SoftwareÑPractice and Experience 1 (1971), 105133

Home exercises

1. Consider the problem of sorting an array of four integers. Write a program that solves this problem as efficiently as possible. What is the pure C cost of your program? How many element comparisons do you need?

2. Many sorting routines are hybrid where the small inputs are handled by insertion sort. Usually the diversion thresholds between 10 and 50 work well. Write a tuned version of insertion sort and analyse the pure C cost of your routine. (My tuned version sorts n integers with 2n² + 8n + O(1) pure C operations in the worst case. Can you do better?)

3. Below you get two program segments that partition an array x[l..h] around the value v, i.e., after the partitioning all elements smaller than v appear in the same array before the elements larger than v. Tranform the programs to use the pointer notation instead of the array notation and remove all useless pointer artithmetic. Analyse both the worst case and the best case of these optimized programs in the pure C terms.

Partitioning by the wheel technique [Jon Bentley, Programming Pearls, Addison-Wesley Publishing Company, Reading (1986), p. 110]:

m = l - 1;

for (i = l; i <= h; i++) {

if (x[i] < v) {

m = m + 1;

temp = x[m]; x[m] = x[i]; x[i] = temp;

}

Partitioning with two meeting pointers [Robert Sedgewick, Algorithms, 2nd edition, Addison-Wesley Publishing Company, Inc. (1988), p. 118]:

i = l - 1; j = h + 1;

x[i] = v; x[j] = v;

do {

do { i = i + 1; } while (x[i] < v);

do { j = j - 1; } while (x[j] > v);

temp = x[i]; x[i] = x[j]; x[j] = temp;

} while (i < j);

temp = x[i]; x[i] = x[j]; x[j] = temp;

4. Sedgewick [Robert Sedgewick, Algorithms in C, Parts 14, 3rd edition, Addison-Wesley Publishing Company, Inc. (1998), p. 346] points out that by reversing the order of the second subarray and merging from the outside to the middle, two consecutive subarrays can be merged without checking to see if subarrays have been exhausted. When one of the two subarrays is fully consumed, its array index will be incremented off the end of the subarray and will point at the largest element in the other subarray. Since this element is larger than each of the other elements in the remaining subarray, it serves as a sentinel and will not be chosen until the final merge step.

Program this merge routine as efficiently as you can and verify that it is faster than the standard merge.

5. Lecture: Utilizing word parallelism

Beforehand reading

Torben Hagerup, Sorting and searching on the word RAM, Proceedings of the 15th Annual Symposium on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science 1373, Springer-Verlag, Berlin/Heidelberg (1998), 366398

Special guest star

Morten Nicolaj Pedersen: Sorting on the word RAM

Home exercises

1. Prove analytically that \sum_{i=1}^{n} 1/i = ln n + \Theta(1) for an integer n >= 1. (Hint: If you have problems with this kind of basic calculus, the book [Ronald L. Graham, Donald E. Knuth, and Oren Patashnik, Concrete Mathematics: A Foundation for Computer Science, 2nd edition, Addison-Wesley Publishing Company, Reading (1994)] is a must for you.)

2. [Jon Bentley, Programming Pearls, Addison-Wesley Publishing Company, Reading (1986), Exercise 6, p. 90] How can sentinels be used in a program to find a maximum element in an array? How can sentinels decrease the search time in sets represented by linked lists, hash tables, and binary search trees?

3. Assume that a binary tree is being represented as follows:

template <class D>

struct Node {

D data;

Node* left;

Node* right;

Node(const D & v, Node* p, Node* q) : data(v), left(p), right(q) {

}

};

Remove the recursion from the following C++ program that is also available electronically with a small driver. Try to make your solution as space-efficient as possible.

void inorder(ostream& s, Node* root) {

if (root != 0) {

inorder(s, root -> left);

s << root -> data;

inorder(s, root -> right);

}

4. Assume that n is a positive integer that fits into one machine word. Consider the problem of computing the whole-number logarithm of n. This problem can be solved in constant time with the following float-conversion hack which is taken from [Arne Andersson and Mikkel Thorup, A pragmatic implementation of monotone priority queues, Worldwide Web Document (1997)].

typedef struct {

unsigned int signbit : 1;

unsigned int exponentbias1023 : 11;

/* unsigned int mantissa : 52; */

} doublemask;

typedef union {

double as_double;

doublemask as_mask;

} doubleandmask;

/* Takes a 32-bit word and finds the leftmost 1-bit */

inline unsigned int leftmost_one(unsigned int n) {

doubleandmask fm;

fm.as_double = (double) n;

return (fm.as_mask.exponentbias1023 - 1023);

}

Propose alternative solutions for the problem and time these solutions against the float-conversion hack.

Programming exercise I

Implement at least two variants of Kruskal's algorithm for finding a minimum spanning tree of a given undirected, weighted, and connected graph G = (V, E, w). Assume that the vertex set V of the input graph is a subset of {0,1,...,N}, where N is some non-negative integer. The vertex set is represented as a boolean array so that true in position i of this array means that vertex i is present, and false that it is not. The edge set E is represented as an array of edges. Furthermore, assume that edge weights given by the function ware non-negative integers that fit into one machine word, but note that the sum of two weights can already give an overflow.

In computing literature several improvements to Kruskal's algorithm have been proposed. Below I list some them; you are welcome to use any of them in your implementation, but you should also think if there are other possibilities.

1) Both in [Bernard M. E. Moret, and Henry D. Shapiro, An empirical analysis of algorithms for constructing a minimum spanning tree, Proceedings of the 2nd International Workshop on Data Structures and Algorithms, Lecture Notes in Computer Science 519, Springer-Verlag, Berlin/Heidelberg (1991), 400-411] and in [John R. Black Jr., Charles U. Martel, and Hongbin Qi, Graph and hashing algorithms for modern architectures: Design and performance, Proceedings of the 2nd Workshop on Algorithm Engineering, Technical Report MPI-I-98-1-019, Max-Planck-Institut für Informatik, Saarbrücken (1998), 3748] it was reported that the representation of the graph has a significant effect on the performance of many graph algorithms. Observe that in our case the edges are given in an array. How costly is it to create an adjacency-list representation, if such is used?

2) In [Donald E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley Publishing Company, Reading (1993), 460468] it was pointed out that in the case of integer weights sorting can be carried out efficiently by using radix sort. The question is whether this is still a good idea when the weights are as large as 2³².

3) In [J. J. Brennan, Minimal spanning trees and partial sorting, Information Processing Letters 1 (1982), 113116] it was observed that it is profitable to perform partial sorting in the hope that not all edges need be considered. A related technique is used in the LEDA library, for further details see the source code available at DIKU [/usr/local/projects2/algorithm/LEDA-3.4.1/src/graph_alg/min_spanning.c].

4) In [A. Kershenbaum and R. Van Slyke, Computing minimum spanning trees efficiently, Proceedings of the 1972 Annual Conference, ACM, New York (1972), 518527] the edges were maintained in a priority queue and, if the endpoints of the edge to be inserted into the queue were already in the same fragment, the edge was rejected immediately. This way the sorting of all edges is avoided.

5) The UNION-FIND part of the algorithm can actually be problematic since most UNION-FIND data structures use heavily random-access capabilities of computers. Due to caching random accesses are expensive. Can this part of the algorithm be implemented so that the memory accesses are more local?

I recommend that you write your programs in C++ and use the tools available in the STL library when possible. Your programs can be built upon the graph class written by me, but it is not obligatory to use it.

Compare the performance of your implementations for randomly generated graphs. When you are measuring actual running times, it might be a good idea to run the programs in at least two different kind of computers. Other relevant measures of goodness are the number of memory references, cache misses, etc.

The emphasis in this exercise is the speed of your programs. In your report you should briefly document your final programs and the results of your experiments. Do not forget to mention the unsuccessful attempts that did not give any speed-up. I would also like hear about the problems you encountered when writing the programs (compiler errors, library bugs, etc.)

Return your answer electronically to jyrki@di.ku.dk in a format I can read (LaTeX, PostScript, rtf, html, plain text). The deadline for this exercise is 18 October 1998 at 24.00 Danish time.

Extra Lecture: Learning C in four hours

Beforehand reading

Al Kelley and Ira Pohl, A Book on C, 4th edition, The Benjamin Cummings Publishing Company, Inc., Redwood City (1998)

Bjarne Stroustrup, The C++ Programming Language, 3rd edition, Addison-Wesley, Reading (1997)

Voluntary home exercises

1. How would you write a routine that checks whether a sorting program works properly or not?

2. It is generally accepted that inline functions are as fast as macros. Is this the case in your instalation?

3. Test the quality of the pseudo-random numbers produced by the class provided in [Bjarne Stroustrup, The C++ Programming Language, 3rd edition, Addison-Wesley, Reading (1997), p. 686].

4. Read through the example programs provided in the lecture notes. Which parts of the programs were difficult to understand and how would you improve them?

6. Lecture: Improving paging performance

Beforehand reading

Jeffrey S. Vitter, External memory algorithms, Worldwide Web Document (1998)

Home exercises

1. Given two words, show that the position of their first differing bit can be found in constant time.

2. Write a function that computes a table of size 2^b whose ith entry gives the number of 1-bits in the binary presentation of i, for each i, 0 <= i < 2^b. Explain how would you use such a table to compute the number of 1-bits in each 32-bit word in a given array of size n. How many pure C operations are needed for both of these computations?

3. Consider the simulation of a sequential access machine with multiple memories (multitape Turing machine) on a normal computer with a two-level paged memory. The standard page-replacement strategy LRU (least recently used) does not necessarily work in a satisfactory manner. Why not? Devise a memory management strategy with which the simulation can be carried out efficiently.

4. Complete the graph class written by me with a function that tests whether a given graph is a tree or not. Assume that your function is executed on a two-level paged memory for a graph with n vertices and m edges. How many page transfers will be performed as a function of n, m, and B, where B is the capacity of a page measured in words. Can you do better?

7. Lecture: Improving cache behaviour

Beforehand reading

Alvin R. Lebeck and David A. Wood, Cache profiling and the SPEC benchmarks: A case study, Computer 27,10 (1994), 1126

Special guest star

Maz H. Spork: Cache-concious design of algorithms

Home exercises

1. [Jon Bentley, Programming Pearls, Addison-Wesley Publishing Company, Reading (1986), Exercise 8, p. 90] Consider the searching problem where we are given a sorted array of n elements of type T and one more element x of the same type and the task is to determine whether the array contains the element x. If x is in the array, a reference to it is returned, otherwise a reference to a null element is returned. It has been claimed that, for n = 1 000, this problem can be solved more efficiently by hashing than by a tuned binary search. Implement a fast hashing routine and compare it to the tuned binary-search routine written by me.

2. An alternative to binary search is square-root(n) search. Here a new array is created from a given sorted array of size n by putting every square-root(n)'th element of the original array to the first locations of the new array and the remaining elements in sorted order after them. Explain how the creation of the new array can be done in-place with linear cost. By using the new array, an element searched for is found with O(square-root(n)) comparisons by scanning the first square-root(n) elements and then the square-root(n) elements in the subarray that may contain the element. Implement a square-root(n)-search routine and compare it experimentally to the tuned binary search routine written by me? Which of these two methods is faster when n = 1 000 000?

3. In [Alvin R. Lebeck and David A. Wood, Cache profiling and the SPEC benchmarks: A case study, Computer 27,10 (1994), 1126] it was pointed out that by reducing the size of a hash table from 69 001 to 5 003 it was possible to speed up a program using it, because the smaller hash table fits in the data cache. Test whether something similar happens on your computer.

4. In [John R. Black Jr., Charles U. Martel, and Hongbin Qi, Graph and hashing algorithms for modern architectures: Design and performance, Proceedings of the 2nd Workshop on Algorithm Engineering, Technical Report MPI-I-98-1-019, Max-Planck-Institut für Informatik, Saarbrücken (1998), 3748] the performance of arrays and linked lists were compared. It was reported that reading all elements of an array can be 10 times faster than the corresponding linked list scan when the space required by the elements exceeds the capacity of the data cache. Carry out experiments to see what happens on your computer.

8. Lecture: Utilizing processor parallelism

Beforehand reading

Cherri M. Pancake, Is parallelism for you?, IEEE Computational Science & Engineering 3,2 (1996), 1837

Home exercises

1. Compare the performance of the following three programs for multiplying matrices of size n x n when n is large. The last two of these programs are taken from [Monica S. Lam, Edward E. Rothberg, and Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, SIGPLAN Notices 26,4 (1991), 6374].

The standard matrix multiplication code:

unsigned int i, j, k;

for (i = 0; i < n; ++i)

for (j = 0; j < n; ++j)

for (k = 0; k < n; ++k)

z[i][j] += x[i][k] * y[k][j];

The standard matrix multiplication code after changing the order of the last two loops:

unsigned int i, j, k;

for (i = 0; i < n; ++i)

for (k = 0; k < n; ++k) {

r = x[i][k];

for (j = 0; j < n; ++j)

z[i][j] += r * y[k][j];

}

The matrix multiplication code blocked to reduce cache misses:

unsigned int kk, jj, i, j, k;

for (kk = 0; kk < n; kk += b)

for (jj = 0; jj < n; jj += b)

for (i = 0; i < n; ++i)

for (k = kk; k < min(kk + b, n); ++k) {

r = x[i][k];

for (j = jj; j < min(jj + b, n); ++j)

z[i][j] += r * y[k][j];

}

What is the optimal blocking factor b on your computer? How does the performance of these programs vary with small changes to the matrix size n?

2. Read Chapter 28 of the book [Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, The MIT Press, Cambridge (1990)] and solve Problem 28-2 therein. (Hint: Correct the error in the problem formulation before solving the problem. This error is not yet mentioned in the bug list of the book.)

3. Show that the computation of the whole-number logarithm of a positive integer n fitting into a machine word is an AC⁰ instruction, i.e., it is computable by an unbounded-fanin circuit of constant depth and polynomial size, where "polynomial" means polynomial in the word length. (Hint: Chapter 29 of the book [Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, The MIT Press, Cambridge (1990)] might be of some help here.)

4. Read Chapter 30 of the book [Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, The MIT Press, Cambridge (1990)] and solve Problem 30-1 therein.

9. Lecture: Utilizing instruction parallelism

Beforehand reading

Margaret Reid-Miller, List ranking and list scan on the Cray C90, Journal of the Computer and System Sciences 53 (1996), 344356

Home exercises

1. Show how the contents of two registers can be swapped without using any other registers. Prove also that your solution is correct.

2. Analyse the pure C cost of the trivial serial routine for list ranking. Do the same for the serial version of the parallel pointer-jumping routine. Test also experimentally how does the order of the list elements effect on the performance of the serial routine when the size of the input is larger than that of the cache of your computer.

3. Consider the block interchange problem studied in Column 2 of [Jon Bentley, Programming Pearls, Addison-Wesley Publishing Company, Reading (1986)]. In this problem we are given an array consisting of two consecutive blocks AB and the task is to reorder the elements of the array to the form BA. The standard in-place solution for this problem is first reverse A to get A^RB, reverse B to get A^RB^R, and then reverse the whole thing to get (A^RB^R)^R, which is exactly BA. Two other methods were described in Bentley's Column (see his solution to Exercise 4). How many page transfers are performed by the reversal method when the program is executed in a paged environment? Can you devise a method that would be more efficient with respect to the number of page transfers carried out?

4. Write a short evaluation of this course. What was good? What was not so good? What was missing altogether?

10. Lecture: Utilizing computer parallelism

Beforehand reading

David Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steve Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Frederick Wong, Parallel computing on the Berkeley NOW, Presented at the 9th Joint Symposium on Parallel Processing, Kobe, Japan (1997)

Programming exercise II

Do A or B, and return your answer electronically to jyrki@di.ku.dk in a format I can read (LaTeX, PostScript, rtf, html, plain text). The deadline for this exercise is 29 November 1998 at 20.00.

Jesper Bojesen has created a library of heaps. In this exercise your task is to extend the library. When solving the exercise, you are welcome to use the source codes available in the Worldwide Web (provided that the copyright allows that). Jesper's programs are available at DIKU in directory /vol/www/forskningpPerformance-engineering/Jesper/heaplab/. In particular, note how Jesper has used the tools make, LaTeX and gnuplot, and the shell scripts. By imitating him you can probably increase your productivity in future programming projects. Other relevant resources are the STL heap implementation, the LEDA priority-queue implementations, and LaMarca's priority-queue implementations.

A. Compare the performance of the following four heap construction algorithms:

Williams' original algorithm where elements are inserted into the heap one at the time [J. W. J. Williams, Algorithm 232: HEAPSORT, Communications of the ACM 7 (1964), 347--348].
Floyd's improved algorithm that requires only a linear number of operations [Robert W. Floyd, Algorithm 245: TREESORT 3, Communications of the ACM 7 (1964), 701].
The modification of Floyd's algorithm proposed by Jesper in Listing 2.5 of his report.
The heap-construction algorithm given in Theorem 8 of [R. Fadel, K.V. Jakobsen, J. Katajainen, J. Teuhola, Heaps and heapsort on secondary storage, Theoretical Computer Science, to appear] that is optimal in a paged environment.

B. Implement (in C++) any priority-queue structure mentioned in any of the articles in my literature list. The data structure implemented should be STL compatible and dynamic, i.e., it should use \Theta(n) space for storing n elements with their associated priorities. In your report you should briefly describe the implemented data structure and compare its performance to the priority queue provided in the STL library and the dynamic heap structure described in Chapter 4 of Jesper's report.

PERFORMANCE ENGINEERING

Contents

Meetings

Credit points

København

Lund

1. Lecture: Introduction

Home exercises

2. Lecture: Programming style

Beforehand reading

Home exercises

3. Lecture: Software tools

Beforehand reading

Special guest star

Home exercises

4. Lecture: Reducing instruction count

Beforehand reading

Home exercises

5. Lecture: Utilizing word parallelism

Beforehand reading

Special guest star

Home exercises

Programming exercise I

Extra Lecture: Learning C in four hours

Beforehand reading

Voluntary home exercises

6. Lecture: Improving paging performance

Beforehand reading

Home exercises

7. Lecture: Improving cache behaviour

Beforehand reading

Special guest star

Home exercises

8. Lecture: Utilizing processor parallelism

Beforehand reading

Home exercises

9. Lecture: Utilizing instruction parallelism

Beforehand reading

Home exercises

10. Lecture: Utilizing computer parallelism

Beforehand reading

Programming exercise II