MPI calculations with Cerberus

From Kraken Wiki
Revision as of 10:21, 2 June 2022 by Ville Valtavirta (talk | contribs) (Running calculations)
Jump to: navigation, search


Cerberus can be used to run certain solvers MPI parallelized across multiple computing nodes.

Required installation

Cerberus uses the mpi4py Python-package for MPI functionality. It can be installed with pip.

Notice, however that mpi4py will get compiled with whichever MPI-libraries you have available during the pip installation, which means that it is best to run the pip installation in an environment, where the correct MPI-libraries have been loaded.

Running calculations

The MPI libraries used to build both the solver modules and the mpi4py package for Cerberus need to be loaded.

Any Solvers that have -mpi or --mpi as their command line argument will be spawned with MPI.Comm.Spawn instead of being initiated as a subprocess.

Cerberus can then be run with a setup similar to the one described below

#!/bin/bash
#SBATCH --cpus-per-task=20
#SBATCH --ntasks=4
#SBATCH --partition=core40
#SBATCH -o output.txt

module load mpi/openmpi-x86_64

mpirun --report-bindings -np 4 --bind-to none -oversubscribe python input.py > cerberus_output.txt

The idea here is the following:

  • Cerberus itself does not use any resources while a Solver is running, so we can -oversubscribe and let the MPI parallelized solver also use the resources allocated to Cerberus.
  • To successfully use -oversubscribe the binding of the tasks to sockets or cores needs to be disabled with --bind-to none.
  • Only one task of Cerberus communicates with only one Solver task.
    • All but one Cerberus task exit upon loading the Cerberus package at cerberus.__init__.py.
    • It can be somewhat tricky to get the remaining Cerberus task and the communicating Solver task (task 0 with Serpent and SuperFINIX) on the same node.
    • --report-bindings with OpenMPI helps the user to analyze, which node each task is spawned on.
    • The rank of the task that continues at cerberus.__init__.py can be adjusted if needed.
    • In the future, a separate --cerberus-host <hostname> argument may be added to Solvers to allow them to connect to Cerberus across nodes.
      • Socket communication across nodes is significantly slower than on the same node.