[Ray Core] How to debug a segmentation fault

I am running a custom offline DQL training (which uses Pytorch and CUDA). However, I have encountered several times a segmentation fault. It does not happen systematically. Here are two screenshots of the output logs:

I don’t know if it is happening during a training step or an evaluation as I am using the parameter evaluation_parallel_to_training.

I have I tried to run my script with GDB as follows:

gdb -ex "set print thread-events off" -ex "set confirm off" -ex finish -ex run -ex "info frame" -ex bt -ex quit --args python MY_PYTHON_COMMANDS

Unfortunately, it does not print anything more. I have even tried to add -q -X faulthandler, still the same.

I am using an miniconda environment which does not offer a Python debugging build.

Does anyone has any hints about how to debug such faults ?

Relabeled the question as Ray Core. Hopefully, someone there has an idea or some hints on how to debug these.

@sangcho ?

Doesn’t look like a memory leak. If you think it is, though, let me know @loicsacre . I have a tool for RLlib in the PR pipeline that is good at analyzing those using tracemalloc.

It is hard to say… the stacktrace looks like this happens from the library, not ray itself.

cc @mwtian do you know what are good ways to debug this sort of issues?