Observing `OwnerDiedError` intermittently when running concurrent Ray Tune scripts

Hi, I’m attempting to run multiple concurrent Ray Tune scripts at once on a Ray Cluster that uses KubeRay on GKE. I have ~10 nodes, each with 16 CPUs and 16 GB of RAM.

Each Ray Tune script connects to my Ray Cluster using the Ray Client and runs 50 trials and 5 concurrent trials. The objective function for each trial kicks off ~50 more embarrassingly parallel remote tasks that each train an XGBoost model using subsets of data that was put in the Plasma object store. I have TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED=1 set on my cluster since the docs suggest doing this when running concurrent Ray Tune scripts.

When I kick off 5 concurrent Ray Tune scripts, some of them succeed, but then some of them fail (usually all at around the same time) after completing most of their trials with the following error:

(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(func pid=6138, ip=10.4.30.5)     return func(*args, **kwargs)
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
(func pid=6138, ip=10.4.30.5)     raise value.as_instanceof_cause()
(func pid=6138, ip=10.4.30.5) ray.exceptions.RayTaskError: ray::run_backtest_fold() (pid=5161, ip=10.4.21.4)
(func pid=6138, ip=10.4.30.5)   At least one of the input arguments for this task could not be computed:
(func pid=6138, ip=10.4.30.5) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Py
thon, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(func pid=6138, ip=10.4.30.5)
(func pid=6138, ip=10.4.30.5) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000
000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.4.9.169) for more information about the Python worker failure.
(func pid=6138, ip=10.4.30.5) 2022-06-17 15:21:32,401   ERROR function_runner.py:286 -- Runner Thread raised error.
(func pid=6138, ip=10.4.30.5) Traceback (most recent call last):
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
(func pid=6138, ip=10.4.30.5)     self._entrypoint()
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
(func pid=6138, ip=10.4.30.5)     return self._trainable_func(
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(func pid=6138, ip=10.4.30.5)     return method(self, *_args, **_kwargs)
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(func pid=6138, ip=10.4.30.5)     output = fn()
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 211, in _objective_function
(func pid=6138, ip=10.4.30.5)     raise error
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 168, in _objective_function
(func pid=6138, ip=10.4.30.5)     complete_folds_metrics = run_backtest(
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/crystal/ray/hpo.py", line 82, in run_backtest
(func pid=6138, ip=10.4.30.5)     result = ray.get(finished)
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(func pid=6138, ip=10.4.30.5)     return func(*args, **kwargs)
(func pid=6138, ip=10.4.30.5)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1831, in get
(func pid=6138, ip=10.4.30.5)     raise value.as_instanceof_cause()
(func pid=6138, ip=10.4.30.5) ray.exceptions.RayTaskError: ray::run_backtest_fold() (pid=5161, ip=10.4.21.4)
(func pid=6138, ip=10.4.30.5)   At least one of the input arguments for this task could not be computed:
(func pid=6138, ip=10.4.30.5) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Py
thon, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(func pid=6138, ip=10.4.30.5)
(func pid=6138, ip=10.4.30.5) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000
000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.4.9.169) for more information about the Python worker failure.
(run pid=3110) 2022-06-17 15:21:32,517  ERROR trial_runner.py:886 -- Trial _objective_function_d59b31cc: Error processing event.

The logs referenced in the error also look like normal “driver” logs:

[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:591: Disconnecting to the raylet.
[2022-06-17 15:19:41,102 I 1392 1471] raylet_client.cc:162: RayletClient::Disconnect, exit_type=INTENDED_EXIT, has creation_task_exception_pb_bytes=0
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:539: Shutting down a core worker.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:563: Disconnecting a GCS client.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:567: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-06-17 15:19:41,102 I 1392 1494] core_worker.cc:679: Core worker main io service stopped.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:576: Core worker ready to be deallocated.
[2022-06-17 15:19:41,102 I 1392 1471] core_worker_process.cc:298: Removed worker 05000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2022-06-17 15:19:41,102 I 1392 1471] core_worker.cc:530: Core worker is destructed
[2022-06-17 15:19:41,171 I 1392 1471] core_worker_process.cc:154: Destructing CoreWorkerProcessImpl. pid: 1392
[2022-06-17 15:19:41,171 I 1392 1471] io_service_pool.cc:47: IOServicePool is stopped.

The driver logs of the other successful Ray Tune scripts look the same.

I also enabled RAY_record_ref_creation_sites=1 on my cluster to see if I could track down the ObjectRef that the error message complains about. Using ray memory after one of my script runs failed, I can see the ObjectRef that the error was referring too still in the Plasma store:

10.4.9.169       229    Worker  /home/ray/anaconda3/li  -               12672606.0 B  LOCAL_REFERENCE     00ffffffffffffffffffffffffffffffffffffff0500000002000000
                                b/python3.8/site-packa
                                ges/ray/tune/registry.
                                py:get:196 | /home/ray
                                /anaconda3/lib/python3
                                .8/site-packages/ray/t
                                une/registry.py:get_tr
                                ainable_cls:44 | /home
                                /ray/anaconda3/lib/pyt
                                hon3.8/site-packages/r
                                ay/tune/trial.py:get_t
                                rainable_cls:731 | /ho
                                me/ray/anaconda3/lib/p
                                ython3.8/site-packages
                                /ray/tune/trial.py:__i
                                nit__:290

If I understand correctly, this corresponds to my trainable function (I’m using the function API).

Some other observations that may be helpful:

  • No object spilling is happening on my cluster. I’m only putting about 12MiB in the Plasma object store per Tune script. This data is shared across all nested tasks that one script’s objective function creates.
  • I’m seeing a lot of logs in raylet.out along the lines of: [2022-06-19 13:05:16,694 I 58 58] (raylet) object_buffer_pool.cc:153: Not enough memory to create requested object 00ffffffffffffffffffffffffffffffffffffff0500000002000000, aborting – are these anything to be worried about, especially considering that the object store on my cluster doesn’t seem to be heavily utilized?
  • I’m also seeing logs in raylet.out like: [2022-06-19 13:05:06,077 W 58 58] (raylet) task_spec.cc:50: More than 120 types of tasks seen, this may reduce performance. Is there something I should be doing to fix these?

Are there quirks to running multiple concurrent Ray Client connections and / or Ray Tune scripts that I should be aware of? Is it possible that the Ray Client server or core driver worker is being killed prematurely on the head node?

Thank you in advance for your time :pray:. Please let me know if any additional information would be helpful. I’m loving using Ray so far. I’m hoping to be able to run many concurrent Ray Tune scripts on a single long-lived KubeRay cluster so any help with these issues would be much appreciated!

Hey @smacpher-myst, thanks so much for including a bunch of details in the description!

Are there quirks to running multiple concurrent Ray Client connections and / or Ray Tune scripts that I should be aware of?

Yeah this is my hunch as to what’s causing the issues, specifically having multiple Tune scripts running in parallel on the same cluster may be leading to some sort of undesired resource contention/removal.

Could you share a little more about your use-case, specifically why you’re running multiple Tune scripts in parallel on the same cluster? I’m also curious to know if it’s possible to run this as a single Tune experiment or to run each experiment on a separate cluster.

1 Like

Hi @matthewdeng, thank you for the very quick response! Of course – I’ve been digging into this issue for a few days so I figured I would try to include as much as possible to try to help us get to the bottom of this!

Yeah this is my hunch as to what’s causing the issues, specifically having multiple Tune scripts running in parallel on the same cluster may be leading to some sort of undesired resource contention/removal.

Got it – that’s good to know. Do you know the details on why running multiple Tune runs could lead to issues? I’d love to understand! Also sometimes all concurrent Tune scripts succeed; anecdotally this usually happens when they are smaller (e.g. fewer trials, fewer nested tasks, etc.).

Yes, of course! I’m hoping to operate a KubeRay cluster that data scientists at my company can use to run tuning jobs on. I’m kicking off Kubernetes Jobs on behalf of them that connect to the KubeRay cluster using the Ray Client. It’s possible that multiple jobs could be kicked off at around the same time and I’d love to be able to run them concurrently if possible, to minimize waiting time. So it’s a requirement that multiple tuning jobs can be run at the same time (e.g. in case a data scientist wants to run multiple tuning jobs at once, or if multiple data scientists want to run tuning jobs at the same time). Does that make sense? I can start thinking of workarounds if there are hard blockers that get in the way of running multiple Tune scripts at the same time reliably. It should be possible to run each on their own cluster however, it would probably require more maintenance overhead than a single large cluster. Let me know if I can provide any more details!

Also just wanted to say Ray has been a pleasure to work with so far! Very powerful tool that you all are building!!

In general multitenancy at the experiment level isn’t well supported/tested. Even TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED is an experimental flag that unfortunately I’d say is not production ready.

Your use case makes sense! I actually do think launching a single cluster per experiment would be more suitable here.

  1. When no jobs are being run you won’t have to maintain an idle cluster.
  2. When users do launch jobs they have better isolation of resources, logs, etc. and can more easily understand the state of their experiment.

cc @rliaw would love to get your thoughts on this in case I missed anything!

@matthewdeng Got it – I’ll investigate setting up individual Ray Clusters per jobs then. I was hoping not to have to worry about spinning up and down clusters that often, but it seems like KubeRay is working on support for this so I can look into that.

Do you think support for multitenancy will be added to Ray Tune in the future? In general, I have found it nice to maintain a single long-lived cluster when it works! Is this also the case for Ray Core (i.e. that having multiple Ray Clients / scripts running at the same time isn’t well supported)?

Oops sorry for the late reply here, totally missed this message…

Do you think support for multitenancy will be added to Ray Tune in the future?

I’d say we don’t explicitly not support multitenancy, but aren’t actively investing resources into building out this story. If you run into issues that can be easily reproduced, I’d be happy to look into them and get them triaged!

Is this also the case for Ray Core

Yep, in general it should work, but it’s not clear if each individual client connection is able to get the desirable isolation that they’re expecting. One worker might hog up all the CPUs, another might expect all of the memory/disk to themselves, etc.

@matthewdeng Got it good to know. We’re pursuing the one cluster per job approach right now but may try to pick up the multi-tenancy approach again in the future. If and when we do that, I can try to create a reproduction setup.

Yep, in general it should work, but it’s not clear if each individual client connection is able to get the desirable isolation that they’re expecting. One worker might hog up all the CPUs, another might expect all of the memory/disk to themselves, etc.

Is this still the case if we are specifying placement groups carefully s.t. we carve out a piece of the cluster for each job that’s running? For instance, if we have 100 cores available on our cluster, we are making sure not to schedule Ray Tune experiments that would use more cores than we have available. When the concurrent Tune jobs were running, everything looked as expected – we were only scheduling the expected jobs on each of our nodes and there wasn’t any contention since we sized our placement groups appropriately. It almost seemed like when one experiment would finish successfully it would clean up resources (perhaps Ray Core Workers, client servers, or some other bookkeeping mechanism that Ray was using) that the other experiments were using. I’d be curious to know where in the Ray Tune / Ray Core code this behavior could be traced, too, if any of this rings a bell. Otherwise, no worries :slight_smile: this was already super helpful! Thanks!

Yeah if you have some tighter guarantees around setting up placement groups in your scheduling layer that is certainly possible! One thing to be aware of is that when reserving CPUs with Ray, it doesn’t actually reserve the physical core itself, and is up to your application logic to respect the amount of resources it is being “allocated”.

The behavior you’re mentioning does sound like what TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED is supposed to fix which is defined here, but beyond that I’m not sure if it’s a bug in this particular logic or somewhere else in the stack.

Got it that makes sense. We’re doing both of those things, which makes this behavior extra mysterious. I’m not that familiar with the lower level Ray code. If I had to guess, based on the behavior of multiple jobs failing around the same time after some succeed, I wonder if multiple experiments are reserving overlapping resources unintentionally? Either at the placement group level or the Ray Client level.

Luckily we aren’t blocked by this since we’ve switched to running each experiment on single-node Ray clusters. I’ll be on the lookout for any issues that look related to this. Thanks for all of the help!