Have fun with MPI in C

spagnuolocarmine

140.4K views

GitHub

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Previous: MPI Communicators Next: Chapter Questions

Collective Communications Routines

Collective communication is defined as communication that involves a group or groups of processes. One of the key arguments in a call to a collective routine is a communicator that defines the group or groups of participating processes and provides a context for the operation. All processes in the group identified by the intracommunicator must call the collective routine.

The syntax and semantics of the collective operations are defined to be consistent with the syntax and semantics of the point-to-point operations. Thus, general datatypes are allowed and must match between sending and receiving processes. Several collective routines such as broadcast and gather have a single originating or receiving process. Such a process is called the root. Some arguments in the collective functions are specified as significant only at root, and are ignored for all participants except the root.

To understand how collective operations apply to intercommunicators, is possible to view the MPI intracommunicator collective operations as fitting one of the following categories :

All-To-One, such as gathering (see Figure) or reducing (see Figure) in one process data.
One-To-All, such as broadcasting (see Figure) data on all processors in a group.
All-To-All, such as executing one collective operation using all processors in a group as root.
Other

MPI_COLLECTIVE

In the following, all the MPI collective communications will be described by example.

A fundamental collective operation is the explicit synchronization between processors in a group.

MPI_BARRIER(comm) If comm is an intracommunicator, MPI_BARRIER blocks the caller until all group members have called it. The call returns at any process only after all group members have entered the call.

MPI_BARRIER

int MPI_Barrier(MPI_Comm comm)

IN comm, communicator (handle)

The following example uses 8 processes.

MPI BARRIER

Broadcast

MPI_BCAST(buffer, count, datatype, root, comm) If comm is an intracommunicator, MPI_BCAST broadcasts a message from the process with rank root to all processes of the group, itself included. It is called by all members of the group using the same arguments for comm and root. On return, the content of root's buffer is copied to all other processes.

MPI_BCAST

int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root,MPI_Comm comm)

INOUT buffer, starting address of buffer (choice)
IN count, number of entries in buffer (non-negative integer)
IN datatype, data type of buffer (handle)
IN root, rank of broadcast root (integer)
IN comm, communicator (handle)

The following example uses 8 processes.

MPI BCAST

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <assert.h>
int main(int argc, char** argv) {
  int num_elements = 100;
  MPI_Init(NULL, NULL);
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  int* data = (int*)malloc(sizeof(int) * num_elements);
  assert(data != NULL);
  MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
  free(data);
  MPI_Finalize();
}

Why we should use collective operation for group communications?

MPI collective operations exploit optimized solutions to realize communication between processors in a group. For instance, the broadcasting operation exploits a tree structure (as depicted in the Figure), which allows parallelizing the communications.

MPI_BCAST TREE

Obviously the effect of this optimization scales according to the number of processors involved in the communications. The following example presents a comparison between the MPI_BCAST operation and its version developed using MPI_Send and MPI_Receive. If you can not see the advantage to use the broadcast operation please run this experiment on more processors.

MPI BCAST COMPARE

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <assert.h>
void my_bcast(void* data, int count, MPI_Datatype datatype, int root,
              MPI_Comm communicator) {
  int world_rank;
  MPI_Comm_rank(communicator, &world_rank);
  int world_size;
  MPI_Comm_size(communicator, &world_size);
  if (world_rank == root) {
    // If we are the root process, send our data to everyone
    int i;
    for (i = 0; i < world_size; i++) {
      if (i != world_rank) {
        MPI_Send(data, count, datatype, i, 0, communicator);
      }
    }
  } else {
    // If we are a receiver process, receive the data from the root
    MPI_Recv(data, count, datatype, root, 0, communicator, MPI_STATUS_IGNORE);
  }
}
int main(int argc, char** argv) {
  int num_elements = 100;
  int num_trials = 10;
  MPI_Init(NULL, NULL);
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  double total_my_bcast_time = 0.0;
  double total_mpi_bcast_time = 0.0;
  int i;
  int* data = (int*)malloc(sizeof(int) * num_elements);
  assert(data != NULL);
  for (i = 0; i < num_trials; i++) {
    // Time my_bcast
    // Synchronize before starting timing
    MPI_Barrier(MPI_COMM_WORLD);
    total_my_bcast_time -= MPI_Wtime();
    my_bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
    // Synchronize again before obtaining final time
    MPI_Barrier(MPI_COMM_WORLD);
    total_my_bcast_time += MPI_Wtime();

Gather

MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) If comm is an intracommunicator, each process (root process included) sends the contents of its send buffer to the root process. The root process receives the messages and stores them in rank order. General, derived datatypes are allowed for both sendtype and recvtype. The type signature of sendcount, sendtype on each process must be equal to the type signature of recvcount, recvtype at the root. This implies that the amount of data sent must be equal to the amount of data received, pairwise between each process and the root. Distinct type maps between sender and receiver are still allowed.

All arguments to the function are significant on process root, while on other processes, only arguments sendbuf, sendcount, sendtype, root, and comm are significant. The arguments root and comm must have identical values on all processes. Note that the recvcount argument at the root indicates the number of items it receives from each process, not the total number of items it receives.

MPI_GATHER

int MPI_Gather(const void* sendbuf, int sendcount, MPI_Datatype sendtype,void* recvbuf, int recvcount, MPI_Datatype recvtype, int root,MPI_Comm comm)

IN sendbuf, starting address of send buffer (choice)
IN sendcount, number of elements in send buffer (non-negative integer)
IN sendtype, data type of send buffer elements (handle)
OUT recvbuf, address of receive buffer (choice, significant only at root)
IN recvcount, number of elements for any single receive (non-negative integer, significant only at root)
IN recvtype, data type of recv buffer elements (significant only at root) (handle)
IN root, rank of receiving process (integer)
IN comm, communicator (handle)

The following example uses 3 processes.

MPI GATHER

#include <stdio.h>
#include "mpi.h"
int main( int argc, char **argv )
{
    int isend, irecv[3];
    int rank, size;
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    isend = rank + 1;
    MPI_Gather(&isend, 1, MPI_INT, &irecv, 1, MPI_INT, 0, MPI_COMM_WORLD);
    if(rank == 0)
    printf("irecv = %d %d %d\n", irecv[0], irecv[1], irecv[2]);
    MPI_Finalize();
    return 0;
}

MPI_GATHERV(sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, root, comm) extends the functionality of MPI_GATHER by allowing a varying count of data from each process, since recvcounts is now an array. It also allows more flexibility as to where the data is placed on the root, by providing the new argument, displs. The data received from process j is placed into recvbuf of the root process beginning at offset displs[j] elements (in terms of the recvtype). The receive buffer is ignored for all non-root processes.

MPI_GATHER_V

int MPI_Gatherv(const void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, const int recvcounts[], const int displs[], MPI_Datatype recvtype, int root, MPI_Comm comm)

IN sendbuf, starting address of send buffer (choice)
IN sendcount, number of elements in send buffer (non-negative integer)
IN sendtype, data type of send buffer elements (handle)
OUT recvbuf, address of receive buffer (choice, significant only at root)
IN recvcounts, non-negative integer array (of length group size) containing the number of elements that are received from each process (significant only at root)
IN displs, integer array (of length group size). Entry i specifies the displacement relative to recvbuf at which to place the - incoming data from process i (significant only at root)
IN recvtype, data type of recv buffer elements (significant only at root) (handle)
IN root, rank of receiving process (integer)
IN comm, communicator (handle)

The following example uses 4 processes.

MPI GATHER

#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
    int buffer[6];
    int rank, size, i;
    int receive_counts[4] = { 0, 1, 2, 3 };
    int receive_displacements[4] = { 0, 0, 1, 3 };
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    for (i=0; i<rank; i++)
    {
        buffer[i] = rank;
    }
    MPI_Gatherv(buffer, rank, MPI_INT, buffer, receive_counts, receive_displacements, MPI_INT, 0, MPI_COMM_WORLD);
    if (rank == 0)
    {
        for (i=0; i<6; i++)
        {
            printf("[%d]", buffer[i]);
        }
        printf("\n");
        fflush(stdout);
    }
    MPI_Finalize();
    return 0;
}

Scatter

MPI_SCATTER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) takes an array of elements and distributes the elements in the order of process rank. An alternative description is that the root sends a message with MPI_Send(sendbuf, sendcount x n, sendtype, ...). This message is split into n equal segments, the i-th segment is sent to the i-th process in the group, and each process receives this message. The send buffer is ignored for all non-root processes.

MPI_SCATTER

int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

IN sendbuf, address of send buffer (choice, significant only at root)
IN sendcount, number of elements sent to each process (non-negative integer, significant only at root)
IN sendtype, data type of send buffer elements (significant only at root) (handle)
OUT recvbuf, address of receive buffer (choice)
IN recvcount, number of elements in receive buffer (non-negative integer)
IN recvtype, data type of receive buffer elements (handle)
IN root, rank of sending process (integer)
IN comm, communicator (handle)

The following example uses 3 processes.

MPI SCATTER

#include <stdio.h>
#include "mpi.h"
int main( int argc, char **argv )
{
    int isend[3], irecv;
    int rank, size, i;
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
    if(rank == 0){
      for(i=0;i<size;i++) isend[i] = i+1;
    }
    MPI_Scatter(&isend, 1, MPI_INT, &irecv, 1, MPI_INT, 0, MPI_COMM_WORLD);
    printf("rank = %d: irecv = %d\n", rank,irecv);
    MPI_Finalize();
    return 0;
}

MPI_SCATTERV(sendbuf, sendcounts, displs, sendtype, recvbuf, recvcount, recvtype, root, comm) is the inverse operation to MPI_GATHERV. MPI_SCATTERV extends the functionality of MPI_SCATTER by allowing a varying count of data to be sent to each process, since sendcounts is now an array. It also allows more flexibility as to where the data is taken from on the root, by providing an additional argument, displs.

int MPI_Scatterv(const void* sendbuf, const int sendcounts[], const int displs[], MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

IN sendbuf, address of send buffer (choice, significant only at root)
IN sendcounts, non-negative integer array (of length group size) specifying the number of elements to send to each rank
IN displs integer array (of length group size). Entry i specifies the displacement (relative to sendbuf) from which to take the outgoing data to process i
IN sendtype data type of send buffer elements (handle)
OUT recvbuf address of receive buffer (choice)
IN recvcount number of elements in receive buffer (non-negative integer)
IN recvtype data type of receive buffer elements (handle)
IN root rank of sending process (integer)
IN comm communicator (handle)

The following example uses 10 processes.

MPI SCATTERV

#include "mpi.h"
#include <stdio.h>
 
#define MAX_PROCESSES 10
 
int main( int argc, char **argv )
{
    int rank, size, i,j;
    int table[MAX_PROCESSES][MAX_PROCESSES];
    int row[MAX_PROCESSES];
    int errors=0;
    int participants;
    int displs[MAX_PROCESSES];
    int send_counts[MAX_PROCESSES];
 
    MPI_Init( &argc, &argv );
    MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    MPI_Comm_size( MPI_COMM_WORLD, &size );
 
    /* A maximum of MAX_PROCESSES processes can participate */
    if ( size > MAX_PROCESSES ) participants = MAX_PROCESSES;
    else participants = size;
    if ( (rank < participants) ) {
        int recv_count = MAX_PROCESSES;
        /* If I'm the root (process 0), then fill out the big table and setup send_counts and displs arrays */
        if (rank == 0)
            for ( i=0; i<participants; i++) {
                send_counts[i] = recv_count;
                displs[i] = i * MAX_PROCESSES;
                for ( j=0; j<MAX_PROCESSES; j++ )
                    table[i][j] = i+j;
            }
 
        /* Scatter the big table to everybody's little table */
        MPI_Scatterv(&table[0][0], send_counts, displs, MPI_INT,
                           &row[0] , recv_count, MPI_INT, 0, MPI_COMM_WORLD);
 
        /* Now see if our row looks right */
        for (i=0; i<MAX_PROCESSES; i++)
            if ( row[i] != i+rank ) errors++;
    }
 
    MPI_Finalize();
    return errors;
}

Other collective operations

MPI_ALLGATHER, MPI_ALLGATHERV: A variation on Gather where all members of a group receive the result.
MPI_ALLTOALL, MPI_ALLTOALLV: Scatter/Gather data from all members to all members of a group (also called complete exchange).
MPI_ALLREDUCE, MPI_REDUCE: Global reduction operations such as sum, max, min, or user-defined functions, where the result is returned to all members of a group and a variation where the result is returned to only one member.

Nonblocking Collective Communication

As described in Nonblocking Communication, performance of many applications can be improved by overlapping communication and computation, and many systems enable this. Nonblocking collective operations combine the potential benefits of nonblocking point-to-point operations, to exploit overlap and to avoid synchronization, with the optimized implementation and message scheduling provided by collective operations. One way of doing this would be to perform a blocking collective operation in a separate thread. An alternative mechanism that often leads to better performance (e.g., avoids context switching, scheduler overheads, and thread management) is to use nonblocking collective communication.

The nonblocking collective communication model is similar to the model used for nonblocking point-to-point communication. A nonblocking call initiates a collective operation, which must be completed in a separate completion call. Once initiated, the operation may progress independently of any computation or other communication at participating processes. In this manner, nonblocking collective operations can mitigate possible synchronizing effects of collective operations by running them in the "background". In addition to enabling communication-computation overlap, nonblocking collective operations can perform collective operations on overlapping communicators, which would lead to deadlocks with blocking operations. Their semantic advantages can also be useful in combination with point-to-point communication.

As in the nonblocking point-to-point case, all calls are local and return immediately, irrespective of the status of other processes. The call initiates the operation, which indicates that the system may start to copy data out of the send buffer and into the receive buffer. Once initiated, all associated send buffers and buffers associated with input arguments (such as arrays of counts, displacements, or datatypes in the vector versions of the collectives) should not be modified, and all associated receive buffers should not be accessed, until the collective operation completes. The call returns a request handle, which must be passed to a completion call.

All completion calls (e.g., MPI_WAIT) are supported for nonblocking collective operations. Similarly to the blocking case, nonblocking collective operations are considered to be complete when the local part of the operation is finished, i.e., for the caller, the semantics of the operation are guaranteed and all buffers can be safely accessed and modified. Completion does not indicate that other processes have completed or even started the operation (unless otherwise implied by the description of the operation). Completion of a particular nonblocking collective operation also does not indicate completion of any other posted nonblocking collective (or send-receive) operations, whether they are posted before or after the completed operation.

This operations are defined by adding the I character in the name of the operation. For instance, the non blocking broadcast operation is name MPI_Ibcast.

Open Source Your Knowledge: become a Contributor and help others learn. Create New Content

Open Source Your Knowledge, Become a Contributor

17/22 Collective Communications Routines

Collective Communications Routines

Broadcast

Why we should use collective operation for group communications?

Gather

Scatter

Other collective operations

Nonblocking Collective Communication

La courbe du Dragon - S5

HEPL - Bloc 1, Q1 - Labos C - P Worontzoff

Programmation appliquée à l'électronique - Niveau II

Programação C