Single Program Multiple Data, SPMD, parallel processing is most often implemented using the Message Passing Interface, MPI, which is a C-library with a Fortran binding that interfaces to a Fortran "node compiler" that is unaware that the program is SPMD and communicates between images by passing messages via library calls on both the sending and receiving image. In contrast,Co-Array Fortran implements SPMD by adding language elements, e.g. co-arrays, to Fortran and communicating between images by using direct memory to memory copies of co-arrays.
There is a significant speed difference between message passing and direct memory copies on machines with a global hardware memory. The following plot shows the wall time for a halo exchange operation that is often required by domain decomposition SPMD codes. Each image exchanges a buffer of values with its four nearest neighbors (in a 2-D mesh topology). Halo exchange can be thought of as a low level communication benchmark that is more realistic than the often used ping-pong test. Wall times for small halo lengths represent communication latency and wall times for large halo lengths represent communication bandwidth.
The results on several kinds of machines fall into three broad classes.
The Cray T3E has both the lowest latency and the highest bandwidth. This translates directly into better scalability to large numbers of nodes (up to 1,000) for a wide range of problems. However, the low latency is only available via Co-Array Fortran or SHMEM. In general, Co-Array Fortran provides a bigger "win" over alternative approaches as the number of images increases. There are relatively few 500+ node machines today, but they will be increasingly important in the future.
Direct memory access is also easier to use than message passing, because with message passing both images must agree to communicate. For example, here is the simplest way to exchange an array with your north and south neighbors in MPI:
COMMON/XCTILB4/ B(N,4) SAVE /XCTILB4/ CALL MPI_SEND(B(1,1),N,MPI_REAL,IMG_N, 9905, + MPI_COMM_WORLD, MPIERR) CALL MPI_SEND(B(1,2),N,MPI_REAL,IMG_S, 9906, + MPI_COMM_WORLD, MPIERR) CALL MPI_RECV(B(1,3),N,MPI_REAL,IMG_S, 9905, + MPI_COMM_WORLD, MPISTAT, MPIERR) CALL MPI_RECV(B(1,4),N,MPI_REAL,IMG_N, 9906, + MPI_COMM_WORLD, MPISTAT, MPIERR)Each image first sends its buffers and then waits for corresponding buffers to arrive from the neighboring images. This version only works because of internal buffering within the MPI library (which need not be available). All calls block until it is safe to reuse B, so without internal MPI buffering all images will block forever on the first send waiting for its northern neighbor to acknowledge receipt of the message.
MPI is a large library with many options, and the fastest way to do neighbour exchange varies from machine to machine. However, the following is rarely slow and often the fastest MPI implementation:
COMMON/XCTILB4/ B(N,4) SAVE /XCTILB4/, NFIRST DATA NFIRST / 0 / C IF (NFIRST.EQ.0) THEN C C PERSISTENT COMMUNICATION REQUESTS C NFIRST = 1 CALL MPI_SEND_INIT( + B(1,1),N,MPI_REAL,IMG_N, 9905, + MPI_COMM_WORLD, MPIREQ(1), MPIERR) CALL MPI_SEND_INIT( + B(1,2),N,MPI_REAL,IMG_S, 9906, + MPI_COMM_WORLD, MPIREQ(2), MPIERR) CALL MPI_RECV_INIT( + B(1,3),N,MPI_REAL,IMG_S, 9905, + MPI_COMM_WORLD, MPIREQ(3), MPIERR) CALL MPI_RECV_INIT( + B(1,4),N,MPI_REAL,IMG_N, 9906, + MPI_COMM_WORLD, MPIREQ(4), MPIERR) ENDIF C CALL MPI_STARTALL(4, MPIREQ, MPIERR) CALL MPI_WAITALL( 4, MPIREQ, MPISTAT, MPIERR)On the first pass we create persistent non-blocking communication requests with the appropriate pattern. On all passes, we start four requests at once and then wait for all four to complete. This requires that B be in COMMON, both because persistent requests must reference the same user-level buffers every time they are used and because non-blocking calls are otherwise not safe in Fortran (see below).
In Co-Array Fortran, this is simply:
COMMON/XCTILB4/ B(N,4)[*] SAVE /XCTILB4/ C CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) ) B(:,3) = B(:,1)[IMG_S] B(:,4) = B(:,2)[IMG_N] CALL SYNC_ALL( WAIT=(/IMG_S,IMG_N/) )The first SYNC_ALL waits until the remote B(:,1:2) is ready to be copied, and the second waits until it is safe to overwrite the local B(:,1:2). Only nearest neighbors are involved in the sync.
MPI is a C-library with a Fortran binding. This means that MPI routines do not have explicit interfaces and so copy-in/copy-out is almost certain to occur when passing an array section through the argument list. For example:
INTEGER IB(100) IB(1:100) = IMG IF (MOD(IMG,2).EQ.0) THEN CALL MPI_ISEND(IB(2:100:2),50,MPI_REAL,IMG+1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) ELSE CALL MPI_IRECV(IB(1:100:2),50,MPI_REAL,IMG-1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) ENDIF CALL MPI_WAIT(MPIREQ, MPISTAT, MPIERR) WRITE(6,*) 'IMG,IB(1:2) = ',IMG,IB(1:2)We are sending the even elements of IB on even nodes to the odd elements of IB on odd nodes. Therefore, IB(1) should be IMG-1 on odd nodes, and this would always be the case if blocking MPI calls were used. However, here we are using non-blocking calls and because copy-in copy-out is required the above is equivalent to:
INTEGER IB(100),ITMP(50) IB(1:100) = IMG IF (MOD(IMG,2).EQ.0) THEN ITMP(1:50) = IB(2:100:2) CALL MPI_ISEND(ITMP,50,MPI_REAL,IMG+1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) IB(2:100:2) = ITMP(1:50) ELSE ITMP(1:50) = IB(1:100:2) CALL MPI_IRECV(ITMP,50,MPI_REAL,IMG-1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) IB(1:100:2) = ITMP(1:50) ENDIF CALL MPI_WAIT(MPIREQ, MPISTAT, MPIERR) WRITE(6,*) 'IMG,IB(1:2) = ',IMG,IB(1:2)Since it is the array section copy, ITMP, that is passed to MPI_IRECV the contents of IB(1:100:2) depends on when the receive is complete. It might happen to be complete before the subsequent assignment back into IB, but it need not be complete until after MPI_WAIT returns. OpenMP Fortran can have similar problems if a section of a THREADSHARED array is passed through the argument list to a subroutine that does not have an explicit interface. Co-Array Fortran is probably the only SPMD programming model that has no incompatibilities with Fortran 90/95.
Array sections are always a problem for non-blocking calls, but even contiguous arrays can be a problem if they are not in COMMON. For example:
INTEGER IB(10) IB(1:10) = IMG IF (MOD(IMG,2).EQ.0) THEN CALL MPI_ISEND(IB,10,MPI_REAL,IMG+1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) ELSE CALL MPI_IRECV(IB,10,MPI_REAL,IMG-1, 9901, + MPI_COMM_WORLD, MPIREQ, MPIERR) ENDIF CALL MPI_WAIT(MPIREQ, MPISTAT, MPIERR) II = IB(1) WRITE(6,*) 'IMG,IB(1) = ',IMG,IISince IB is a local array the compiler "knows" that it cannot be changed by the call to MPI_WAIT and so the assignment to II can be moved to before the call to MPI_WAIT. This means that with some compilers the write of II on odd images might return IMG instead of IMG-1. If IB is in COMMON, the compiler can't be sure that it is not changed by the call to MPI_WAIT and hence the assignment to II cannot be made before that call. In general, it is safest to use COMMON for MPI buffers whenever possible.
For more details on MPI's incompatibilities with Fortran, see the MPI-2 standard document.
In summary, Co-Array Fortran is faster than MPI on machines with a shared memory (and probably no slower on shared nothing machines); it avoids the many incompatabilities between MPI and Fortran 95; and it provides a much clearer expression of communication operations than any set of library calls. MPI's primary advantages are that it is the best of the many available message passing libraries; its many incompatabilities with Fortran can usually be worked around; and it is widely available.