Hi, I’m attempting to run multiple concurrent Ray Tune scripts at once on a Ray Cluster that uses KubeRay on GKE. I have ~10 nodes, each with 16 CPUs and 16 GB of RAM.
Each Ray Tune script connects to my Ray Cluster using the Ray Client and runs 50 trials and 5 concurrent trials. The objective function for each trial kicks off ~50 more embarrassingly parallel remote tasks that each train an XGBoost model using subsets of data that was put in the Plasma object store. I have TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED=1
set on my cluster since the docs suggest doing this when running concurrent Ray Tune scripts.
When I kick off 5 concurrent Ray Tune scripts, some of them succeed, but then some of them fail (usually all at around the same time) after completing most of their trials with the following error:
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(func pid=6138, ip=10.4.30.5) return func(*args, **kwargs)
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
(func pid=6138, ip=10.4.30.5) raise value.as_instanceof_cause()
(func pid=6138, ip=10.4.30.5) ray.exceptions.RayTaskError: ray::run_backtest_fold() (pid=5161, ip=10.4.21.4)
(func pid=6138, ip=10.4.30.5) At least one of the input arguments for this task could not be computed:
(func pid=6138, ip=10.4.30.5) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Py
thon, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(func pid=6138, ip=10.4.30.5)
(func pid=6138, ip=10.4.30.5) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000
000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.4.9.169) for more information about the Python worker failure.
(func pid=6138, ip=10.4.30.5) 2022-06-17 15:21:32,401 ERROR function_runner.py:286 -- Runner Thread raised error.
(func pid=6138, ip=10.4.30.5) Traceback (most recent call last):
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
(func pid=6138, ip=10.4.30.5) self._entrypoint()
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
(func pid=6138, ip=10.4.30.5) return self._trainable_func(
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(func pid=6138, ip=10.4.30.5) return method(self, *_args, **_kwargs)
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(func pid=6138, ip=10.4.30.5) output = fn()
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 211, in _objective_function
(func pid=6138, ip=10.4.30.5) raise error
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 168, in _objective_function
(func pid=6138, ip=10.4.30.5) complete_folds_metrics = run_backtest(
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 82, in run_backtest
(func pid=6138, ip=10.4.30.5) result = ray.get(finished)
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(func pid=6138, ip=10.4.30.5) return func(*args, **kwargs)
(func pid=6138, ip=10.4.30.5) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
(func pid=6138, ip=10.4.30.5) raise value.as_instanceof_cause()
(func pid=6138, ip=10.4.30.5) ray.exceptions.RayTaskError: ray::run_backtest_fold() (pid=5161, ip=10.4.21.4)
(func pid=6138, ip=10.4.30.5) At least one of the input arguments for this task could not be computed:
(func pid=6138, ip=10.4.30.5) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Py
thon, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(func pid=6138, ip=10.4.30.5)
(func pid=6138, ip=10.4.30.5) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000
000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.4.9.169) for more information about the Python worker failure.
(run pid=3110) 2022-06-17 15:21:32,517 ERROR trial_runner.py:886 -- Trial _objective_function_d59b31cc: Error processing event.
The logs referenced in the error also look like normal “driver” logs:
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:591: Disconnecting to the raylet.
[2022-06-17 15:19:41,102 I 1392 1471] raylet_client.cc:162: RayletClient::Disconnect, exit_type=INTENDED_EXIT, has creation_task_exception_pb_bytes=0
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:539: Shutting down a core worker.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:563: Disconnecting a GCS client.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:567: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-06-17 15:19:41,102 I 1392 1494] core_worker.cc:679: Core worker main io service stopped.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:576: Core worker ready to be deallocated.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker_process.cc:298: Removed worker 05000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:530: Core worker is destructed
[2022-06-17 15:19:41,171 I 1392 1471] core_worker_process.cc:154: Destructing CoreWorkerProcessImpl. pid: 1392
[2022-06-17 15:19:41,171 I 1392 1471] io_service_pool.cc:47: IOServicePool is stopped.
The driver logs of the other successful Ray Tune scripts look the same.
I also enabled RAY_record_ref_creation_sites=1
on my cluster to see if I could track down the ObjectRef
that the error message complains about. Using ray memory
after one of my script runs failed, I can see the ObjectRef
that the error was referring too still in the Plasma store:
10.4.9.169 229 Worker /home/ray/anaconda3/li - 12672606.0 B LOCAL_REFERENCE 00ffffffffffffffffffffffffffffffffffffff0500000002000000
b/python3.8/site-packa
ges/ray/tune/registry.
py:get:196 | /home/ray
/anaconda3/lib/python3
.8/site-packages/ray/t
une/registry.py:get_tr
ainable_cls:44 | /home
/ray/anaconda3/lib/pyt
hon3.8/site-packages/r
ay/tune/trial.py:get_t
rainable_cls:731 | /ho
me/ray/anaconda3/lib/p
ython3.8/site-packages
/ray/tune/trial.py:__i
nit__:290
If I understand correctly, this corresponds to my trainable function (I’m using the function API).
Some other observations that may be helpful:
- No object spilling is happening on my cluster. I’m only putting about 12MiB in the Plasma object store per Tune script. This data is shared across all nested tasks that one script’s objective function creates.
- I’m seeing a lot of logs in
raylet.out
along the lines of:[2022-06-19 13:05:16,694 I 58 58] (raylet) object_buffer_pool.cc:153: Not enough memory to create requested object 00ffffffffffffffffffffffffffffffffffffff0500000002000000, aborting
– are these anything to be worried about, especially considering that the object store on my cluster doesn’t seem to be heavily utilized? - I’m also seeing logs in
raylet.out
like:[2022-06-19 13:05:06,077 W 58 58] (raylet) task_spec.cc:50: More than 120 types of tasks seen, this may reduce performance.
Is there something I should be doing to fix these?
Are there quirks to running multiple concurrent Ray Client connections and / or Ray Tune scripts that I should be aware of? Is it possible that the Ray Client server or core driver worker is being killed prematurely on the head node?
Thank you in advance for your time . Please let me know if any additional information would be helpful. I’m loving using Ray so far. I’m hoping to be able to run many concurrent Ray Tune scripts on a single long-lived KubeRay cluster so any help with these issues would be much appreciated!