Frequent SIGSEGV running tune

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, I have been refining a custom RL environment that has been running fine for months under various versions of Ray and pytorch. Recently (with Ray 2.5.0 and 2.5.1) I am seeing Tune trials that are killed by an unknown problem, possibly SIGSEGV, more times than not. In fact, it has been common for 10 of 10 trials to all crash like this after they run for a few 100k to a few million steps. I am running on a single laptop with 16 cpu and 1 RTX 3080 GPU with Ubuntu 20.04. Based on other posts with similar problems, I figured I might be overloading hardware resources, as I have been trying to improve training speed with better resource usage.

My latest jobs are running only 4 trials (all run in simultaneously), and resource usage is 0.8 GPU and 8.0/16.0 cpu. Total memory usage (according to htop) is only 12 GB of the 64 GB available. Also there is oodles of empty space on the hard drive. So it seems I’m not close to any limits. Here is a typical error.txt file from a trial output directory.

Failure # 1 (occurred at 2023-08-03_16-39-56)
The actor died unexpectedly before finishing this task.
	class_name: SAC
	actor_id: 3175f85301f3f6809de2cc8601000000
	pid: 90611
	namespace: ebc50681-de2e-4cf1-a46e-3c3cd2ff20f8
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of f
ile. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is cr
ashed unexpectedly due to SIGSEGV or other unexpected errors.

The code can be found at GitHub - TonysCousin/cda0 at lc-cmd.

Thank you for any ideas you might have.