F.A.Q.¶
tl;dr
Here we collect the most frequent questions that might occur. Please, make sure to inspect this section before filing a GitHub issue.
Code usage¶
I want to have a custom boundary/injection/driving/distribution function/output.
All of that can be done via the tools provided by the problem generator. Please inspect carefully the section dedicated to that. Also have a look at the set of officially supported problem generators some of which might implement a variation of what your original intent is.
Technical¶
Running in a docker container with an AMD card
AMD has a vary brief documentation on the topic. In theory the docker containers that come with the code should work. Just make sure you have the proper groups (render and video) defined and added to the current user. If it complains about access to /dev/kfd, You might have to run docker as a root.
Compilation errors
Before merging with the released stable version, the code is tested on CUDA and HIP GPU compilers, as well as few version of CPU compilers (GCC 9...11, and LLVM 13...17). If you are encountering compiler errors on GPUs, first thing to check is whether the compilers are set up properly (i.e., whether CMake indeed captures the right compilers). Here are a few tips:
-
CUDA @ NVIDIA GPUs: make sure you have a version of
gccwhich is supported by the version of CUDA you are using; check out this unofficial compatibility matrix. In particular, Intel compilers are not very compatible with CUDA, and it is recommended to usegccinstead (you won't gain much by using Intel anyway, since CUDA will be doing the heavy-lifting). -
HIP/ROCm @ AMD GPUs: ROCm library is a headache. The documentation is even more so. Below is our best attempt to outline how to configure things on an AMD card.
- Make sure you have the ROCm library loaded: e.g., run
rocminfo; -
Sometimes the environment variables are not properly set up, so make sure you have the following variables properly defined:
-
CMAKE_PREFIX_PATH=/opt/rocm(or wherever ROCm is installed), CC=hipcc&CXX=hipcc,-
in rare occasions, you might have to also explicitly pass
-D CMAKE_CXX_COMPILER=clang++to cmake during the configuration stage; -
Compile the code with proper Kokkos flags; i.e., for MI250x GPUs you would use:
-D Kokkos_ENABLE_HIP=ONand-D Kokkos_ARCH_AMD_GFX90A=ON.
Now running is a bit trickier and the exact instruction might vary from machine to machine (part of it is because ROCm is much less streamlined than CUDA, but also system administrators on clusters are often more negligent towards AMD GPUs).
-
If you are running this on a cluster -- the first thing to do is to inspect the documentation of the cluster. There you might find the proper
slurmcommand for requesting GPU nodes and binding each GPU to respective CPUs. -
On personal machines figuring this out is a bit easier. First, inspect the output of
rocminfoandrocm-smi. From there, you should be able to find the ID of the GPU you want to use. If you see more than one device -- that means you either have an additional AMD CPU, or an integrated GPU installed as well; ignore them. You will need to override two environment variables: -
HSA_OVERRIDE_GFX_VERSIONset to GFX version that you used to compile the code (if you usedGFX1100Kokkos flag, that would be11.0.0); HIP_VISIBLE_DEVICES, andROCR_VISIBLE_DEVICESboth need to be set to your device ID (usually, it's just a number from 0 to the number of devices that support HIP).
For example, the output of
rocminfo | grep -A 5 "Agent "may look like this:In this case, the required GPU is theAgent 1 ******* Name: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics Vendor Name: CPU -- Agent 2 ******* Name: gfx1100 Uuid: GPU-XX Marketing Name: AMD Radeon™ RX 7700S Vendor Name: AMD -- Agent 3 ******* Name: gfx1100 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMDAgent 2, which supports GFX1100.rocm-smiwill look something like this:so the GPU we need has============================================ ROCm System Management Interface ============================================ ====================================================== Concise Info ====================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ========================================================================================================================== 0 1 0x7480, 19047 35.0°C 0.0W N/A, N/A, 0 0Mhz 96Mhz 29.8% auto 100.0W 0% 0% 1 2 0x15bf, 17218 48.0°C 19.111W N/A, N/A, 0 None 1000Mhz 0% auto Unsupported 82% 5% ========================================================================================================================== ================================================== End of ROCm SMI Log ===================================================DeviceID of0(since it's the dedicated GPU, it might automatically turn off when idle to save power on laptops; hencePower = 0.0W). Now we can run the code with:HSA_OVERRIDE_GFX_VERSION=11.0.0 HIP_VISIBLE_DEVICES=0 ROCR_VISIBLE_DEVICES=0 ./executable ... - Make sure you have the ROCm library loaded: e.g., run
If the code gives an error, how do I know whether the problem is with the Entity itself or with the other libraries it depends on (e.g., Kokkos, ADIOS2, MPI)?
One good way of narrowing the problem down, is to run the so-called minimal examples, provided in the directory called minimal/ in the source. It has detailed instructions on how to compile these examples, and should hopefully be able to verify whether all the installed dependencies work as expected, before looking for an issue in the Entity itself.
Another good way is to compile the code with -D TESTS=ON flag, which will compile all the unit tests, and you can run them one-by-one. You may also compile the tests also with various flags, e.g., -D mpi=ON, -D output=ON, -D precision=double.