Writing checkpoints & restarting
Oftentimes, large simulations cannot be run on a single go. For these cases, Entity provides a functionality to save a so-called "checkpoint" (essentially, a snapshot) of the simulation at a specific timestep, which can further be used to continue the simulation from where it was left off.
Writing checkpoints¶
Since the checkpoint writing relies on the ADIOS2
library, to be able to use checkpointing, the code has to be compiled with the -D output=ON
flag. Configurations for the checkpoint writing are done via the .toml
input file under the block named [checkpoint]
(also see the input file documentation). The following parameters control how often the checkpoint is written, as well as how many snapshots are preserved as the simulation runs.
parameter | description | default |
---|---|---|
interval |
# of timesteps between checkpoints | 1000 |
interval_time |
code-unit time between checkpoints (overrides interval unless interval_time < 0 ) |
-1.0 |
keep |
# of checkpoints to keep (e.g., 2 will keep the latest and the one before it, removing older ones; -1 = keep all, 0 = disable checkpoint writing) |
-1 |
Saving space
Since snapshots can become quite large, to save storage space, it is recommended to set the keep
parameter small. Sometimes, it is useful to have at least one backup checkpoint (i.e., keep >= 2
), as the simulation may crash (due to, e.g., time limit on clusters) during the checkpoint writing, in which case the latest checkpoint might become corrupted.
The simulation will then produce checkpoints of BP5
format written in the checkpoints/
directory. Together with the data, checkpoints will also store all the parameters of the simulation in the corresponding .toml
file.
Continuing (restarting) from a checkpoint¶
To restart the simulation from the latest checkpoint, simply rerun the executable ./entity.xc ... <ARG>
, specifying one of the following command-line arguments: -continue
, -restart
, -resume
, or -checkpoint
(all of these are equivalent). The simulation will then automatically find the newest checkpoint and continue from it.
While most of the simulation parameters will be read from the checkpoint itself, you may also provide an input file with updated parameters (e.g., if you wish to adjust the value of some parameters). Note, however, that not all the parameters can be changed when restarting the simulation. In particular, anything related to the metric, the box extent, the resolution, or units (i.e., ppc0
, larmor0
, skindepth0
) cannot be altered. These immutable parameters, if changed in the new inputfile, will simply be ignored and instead overriden by those read from the checkpoint data. Likewise, Entity current does not support changing the domain decomposition for multi-domain (i.e., MPI) simulations when resuming from a checkpoint.