I am running a custom offline DQL training (which uses Pytorch and CUDA). However, I have encountered several times a segmentation fault. It does not happen systematically. Here are two screenshots of the output logs:
I don’t know if it is happening during a training step or an evaluation as I am using the parameter evaluation_parallel_to_training
.
I have I tried to run my script with GDB as follows:
gdb -ex "set print thread-events off" -ex "set confirm off" -ex finish -ex run -ex "info frame" -ex bt -ex quit --args python MY_PYTHON_COMMANDS
Unfortunately, it does not print anything more. I have even tried to add -q -X faulthandler
, still the same.
I am using an miniconda environment which does not offer a Python debugging build.
Does anyone has any hints about how to debug such faults ?