Ray Tune jobs fails with no explicit reasons

mizhazha · April 10, 2023, 6:33pm

High: It blocks me to complete my task.

I have a Ray Tune job that executes three parallel runs (Horovod Trainer), each run uses 4GPUs. The run is marked as failed, but from the driver log, there is no error message at all. It looks like the run just terminated unexpected.

I tried to figure out what happens. In raylet.err of this job,

[2023-04-10 15:52:17,701 E 369 369] (raylet) node_manager.cc:3097: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 01b57a8cc2700707126d747fcc781bcef6c13a31b632bb8b64c376a8, IP: 100.98.39.46) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 100.98.39.46

It reads to me the head node has OOM issues and worker process got killed.

How to prevent this. Reading Out-Of-Memory Prevention — Ray 2.3.1, it seems the only option for me is to increase head node memory and reduce parallel trials. Should I also limit memory usage of each trial?
I have increased head node memory and one have one trial for tune. Surprisingly, the job may still fail due to head OOM issues, which doesn’t happen for train. I am surprise as tune with one trial is basically a train job.

Any help is appreciated!

xwjiang2010 · April 10, 2023, 6:39pm

could you share the entire scripts (both tune single trial and train case)?

mizhazha · April 10, 2023, 6:50pm

The complete training script is too large to share

Here is the code structure for training:

trainer = HorovodTrainer(
      train_loop_per_worker=ray_job_args["train_loop_per_worker"],
      train_loop_config=ray_job_args["train_loop_config"],
      scaling_config=ray_job_args["scaling_config"],
     horovod_config=HorovodConfig(timeout_s=tuning_config["distribution_strategy"]["timeout_s"])
)
trainer.fit()

Here is the code structure for tuning:

trainer = HorovodTrainer(
    train_loop_per_worker=ray_job_args["train_loop_per_worker"],
    train_loop_config=ray_job_args["train_loop_config"],
    scaling_config=ray_job_args["scaling_config"],
    horovod_config=HorovodConfig(timeout_s=tuning_config["distribution_strategy"]["timeout_s"])
)
tuner = tune.Tuner(
   trainable=trainer,
   tune_config=ray_job_args["tune_config"],
   run_config=ray_job_args["run_config"],
   param_space=ray_job_args["param_space"],
)
result = tuner.fit()

xwjiang2010 · April 10, 2023, 7:54pm

Thanks, this is definitely not expected. Tuning single trial should be the same as training.

Can you confirm is 100.98.39.46 the head node ip?

For OOM issue, are we talking about heap memory or object store memory? Do you see any disk spilling?

mizhazha · April 10, 2023, 8:03pm

Yes, 100.98.39.46 is the head node ip.
It is heap memory. The Disk(root) on Ray dashboard doesn’t show any meaningful numbers.

mizhazha · April 10, 2023, 8:58pm

Also, I am wondering if disable log to head would help ray.init(log_to_driver=False)? I am testing it now, but it may take several hours for me to see whether it is effective.

xwjiang2010 · April 10, 2023, 9:49pm

@ClarenceNg do you have any suggestions here?

mizhazha · April 11, 2023, 12:03am

Just my observation. I noticed that the head memory builds up overtime. As the training iterations increase, the head memory increases.

Also, to previous question, it is node memory. Object store memory on head node is very low according to Ray dashboard.

xwjiang2010 · April 11, 2023, 2:54pm

actually how is data fed into train_loop_per_worker?

mizhazha · April 11, 2023, 4:00pm

@xwjiang2010 Thanks for following up on this! I have a sense of what might be going on. The model I am training is a 400M large language model. Each worker save the checkpoints locally (in pod) and the checkpoints get sync’ed to head by default. I also have a customized Synced to sync head ray_results/ folder to HDFS. I am not sure how Syncer manage the memory, but I guess as number of checkpoints increase, more and more memory should get allocated to the Syncer thread.

After I disabled the head node to HDFS syncer yesterday, the head memory seems more or less in control; however, the run still fails as worker runs into OOM again. Similar to head memory pattern, I notice that the worker memory usage also continuously increase with the number of epochs. So I would like to disable worker to head sync.

Now I am starting a new run disabling worker to head sync via

"run_config": RunConfig(
        sync_config=SyncConfig(syncer=None)
    ),

I will keep you posted whether it works. Three questions for you:

Do you think the hypothesis makes sense?
After I disable the worker to head syncer, I am still able to see worker log on jobSubmitID.log as below. Is this expected?

[2me[36m(RayTrainWorker pid=486, ip=100.96.156.69)e[0m 
 895/1017 [=========================>....] - ETA: 4:22 - loss: 2.2471 - sparse_categorical_accuracy: 0.237834404
[2me[36m(RayTrainWorker pid=485, ip=100.97.242.40)e[0m 
 936/1017 [==========================>...] - ETA: 2:52 - loss: 1.2043 - sparse_categorical_accuracy: 0.5960

I also set log_to_driver=False at ray.init(). What’s the difference between disabling syncer and setting this option?

Thanks in advance!

xwjiang2010 · April 12, 2023, 10:17pm

sorry for replying late.

Now that i understand more about your set up, I would suggest that you only sync checkpoints (model stuff) from trainable to hdfs directly (not via head). You only need to sync some trial metadata from head to hdfs (which should be pretty small compared to model checkpoint). This should be configured easily if you do SyncConfig(upload_dir) stuff.

worker log pipe through (a Ray Core functionality) is not done through Syncer (Ray Tune construct). totally separate path.

Same for log_to_driver (a Ray Core thing). It is not related to Syncer stuff.

mizhazha · April 12, 2023, 10:39pm

Thanks @xwjiang2010 for the explanation! Very helpful. After my experiments, I can confirm it is the syncing process causing memory issues.

Previously I didn’t set up SyncConfig due to this issue (now fixed). If I set up SyncConfig, the sync process will be worker to HDFS directly and head to HDFS directly, without worker to head sync. Correct?

xwjiang2010 · April 12, 2023, 11:00pm

Yes, it will not go through head node.

One word of caution, right now model checkpoints would sync like the following:
distributed worker rank 0(most likely on worker node) → trainable head (most likely on worker node) → hdfs. We are working to even further remove the intermediate step of trainable head. But you are right, it will not go through head node (most likely if trainable is not running on head node)

Topic		Replies	Views
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	3282	May 8, 2023
Tune: ray logs for failed tune trial Ray Libraries (Data, Train, Tune, Serve)	5	700	March 27, 2023
False positive assertion of OOM results in OOM-killer terminating Ray Tune trials Ray Core	3	209	January 24, 2024
Ray (Tune) v2.8 - Instability with workers on GCP Ray Libraries (Data, Train, Tune, Serve)	2	127	September 12, 2024
Entire ray cluster dying unexpectedly Ray Core	11	964	September 20, 2023

Ray Tune jobs fails with no explicit reasons

Related topics