Ray Tune jobs fails with no explicit reasons

  • High: It blocks me to complete my task.

I have a Ray Tune job that executes three parallel runs (Horovod Trainer), each run uses 4GPUs. The run is marked as failed, but from the driver log, there is no error message at all. It looks like the run just terminated unexpected.

I tried to figure out what happens. In raylet.err of this job,

[2023-04-10 15:52:17,701 E 369 369] (raylet) node_manager.cc:3097: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 01b57a8cc2700707126d747fcc781bcef6c13a31b632bb8b64c376a8, IP: over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip

It reads to me the head node has OOM issues and worker process got killed.

  1. How to prevent this. Reading Out-Of-Memory Prevention — Ray 2.3.1, it seems the only option for me is to increase head node memory and reduce parallel trials. Should I also limit memory usage of each trial?
  2. I have increased head node memory and one have one trial for tune. Surprisingly, the job may still fail due to head OOM issues, which doesn’t happen for train. I am surprise as tune with one trial is basically a train job.

Any help is appreciated!

could you share the entire scripts (both tune single trial and train case)?

The complete training script is too large to share :stuck_out_tongue:

Here is the code structure for training:

trainer = HorovodTrainer(

Here is the code structure for tuning:

trainer = HorovodTrainer(
tuner = tune.Tuner(
result = tuner.fit()

Thanks, this is definitely not expected. Tuning single trial should be the same as training.

Can you confirm is the head node ip?

For OOM issue, are we talking about heap memory or object store memory? Do you see any disk spilling?

Yes, is the head node ip.
It is heap memory. The Disk(root) on Ray dashboard doesn’t show any meaningful numbers.

Also, I am wondering if disable log to head would help ray.init(log_to_driver=False)? I am testing it now, but it may take several hours for me to see whether it is effective.

@ClarenceNg do you have any suggestions here?

Just my observation. I noticed that the head memory builds up overtime. As the training iterations increase, the head memory increases.

Also, to previous question, it is node memory. Object store memory on head node is very low according to Ray dashboard.

actually how is data fed into train_loop_per_worker?

@xwjiang2010 Thanks for following up on this! I have a sense of what might be going on. The model I am training is a 400M large language model. Each worker save the checkpoints locally (in pod) and the checkpoints get sync’ed to head by default. I also have a customized Synced to sync head ray_results/ folder to HDFS. I am not sure how Syncer manage the memory, but I guess as number of checkpoints increase, more and more memory should get allocated to the Syncer thread.

After I disabled the head node to HDFS syncer yesterday, the head memory seems more or less in control; however, the run still fails as worker runs into OOM again. Similar to head memory pattern, I notice that the worker memory usage also continuously increase with the number of epochs. So I would like to disable worker to head sync.

Now I am starting a new run disabling worker to head sync via

"run_config": RunConfig(

I will keep you posted whether it works. Three questions for you:

  1. Do you think the hypothesis makes sense?
  2. After I disable the worker to head syncer, I am still able to see worker log on jobSubmitID.log as below. Is this expected?
[2me[36m(RayTrainWorker pid=486, ip=[0m 
 895/1017 [=========================>....] - ETA: 4:22 - loss: 2.2471 - sparse_categorical_accuracy: 0.237834404
[2me[36m(RayTrainWorker pid=485, ip=[0m 
 936/1017 [==========================>...] - ETA: 2:52 - loss: 1.2043 - sparse_categorical_accuracy: 0.5960
  1. I also set log_to_driver=False at ray.init(). What’s the difference between disabling syncer and setting this option?

Thanks in advance!

sorry for replying late.

Now that i understand more about your set up, I would suggest that you only sync checkpoints (model stuff) from trainable to hdfs directly (not via head). You only need to sync some trial metadata from head to hdfs (which should be pretty small compared to model checkpoint). This should be configured easily if you do SyncConfig(upload_dir) stuff.

worker log pipe through (a Ray Core functionality) is not done through Syncer (Ray Tune construct). totally separate path.

Same for log_to_driver (a Ray Core thing). It is not related to Syncer stuff.

Thanks @xwjiang2010 for the explanation! Very helpful. After my experiments, I can confirm it is the syncing process causing memory issues.

Previously I didn’t set up SyncConfig due to this issue (now fixed). If I set up SyncConfig, the sync process will be worker to HDFS directly and head to HDFS directly, without worker to head sync. Correct?

Yes, it will not go through head node.

One word of caution, right now model checkpoints would sync like the following:
distributed worker rank 0(most likely on worker node) → trainable head (most likely on worker node) → hdfs. We are working to even further remove the intermediate step of trainable head. But you are right, it will not go through head node (most likely if trainable is not running on head node)