Date: 05 Jun 2014
From: jyrki@di.ku.dk
Subject: Lunch talk by Jesper

Lunch talk: Implementing a classic: zero-copy all-to-all communication 
with MPI datatypes
Speaker: Jesper Larsson Träff, TU Vienna
Time: Friday, 6 June 2014 at 12.30 - 13.00
Place: Room 3-1-25 (Universitetsparken 1)

Abstract:

We investigate the use of the derived datatype mechanism of MPI (the
Message-Passing Interface) in the implementation of the classic
all-to-all communication algorithm of Bruck et al. (1997).  Through a
series of improvements to the canonical implementation of the
algorithm we gradually eliminate initial and final processor-local
data reorganizations, culminating in a zero-copy version that contains
no explicit, process-local data movement or copy operations: all
necessary data movements are implied by MPI derived datatypes, and
carried out as part of the communication operations.  We furthermore
show how the improved algorithm can be used to solve irregular
all-to-all communication problems (that are not too irregular). The
Bruck algorithm serves as a vehicle to demonstrate descriptive and
performance advantages with MPI datatypes in the implementation of
complex algorithms, and discuss shortcomings and inconveniences in the
current MPI datatype mechanism. In particular, we use and implement
three new derived datatypes (bounded vector, circular vector, and
bucket) not in MPI that might be useful in other contexts. We also
discuss the role of persistent collectives which are currently not
found in MPI for amortizing type creation (and other) overheads, and
implement a persistent variant of the MPI_Alltoall collective.

On two small systems we experimentally compare the algorithmic
improvements to the Bruck et al. algorithm when implemented on top of
MPI, showing the zero-copy version to perform significantly better
than the initial, straight-forward implementation. One of our variants
has also been implemented inside mvapich, and we show it to perform
better than the mvapich implementation of the Bruck et al. algorithm
for the range of processes and problem sizes where it is enabled. The
persistent version of MPI_Alltoall has no overhead and outperforms all
other variants, and in particular improves upon the standard
implementation by 50% to 15% across the full range of problem sizes
considered.
(joint work with Rougier and Hunold)

Jesper's home page: http://www.par.tuwien.ac.at/~traff/

PE-lab's home page: http://www.diku.dk/forskning/performance-engineering/