Tune.run works but TuneGridSearchCV.fit does not work for me

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Most of the examples on Ray Tune works for me (ray/python/ray/tune/examples at master · ray-project/ray · GitHub); however when I try to run an example on tune-sklearn that uses TuneGridSearchCV, I get errors (for example: sgd.py on the tune-sklearn github).

I am running on an LXC container (Ubuntu 20.4), with Ray 1.13.0. On the bottom you can see my error and I’ve also attached the log file:

2022-06-15 21:47:10,529 ERROR trial_runner.py:883 -- Trial _Trainable_b3fe3_00002: Error processing event.
Traceback (most recent call last):
  File "sgd.py", line 33, in <module>
    tune_search.fit(x_train, y_train)
  File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_basesearch.py", line 622, in fit
    return self._fit(X, y, groups, tune_params, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_basesearch.py", line 533, in _fit
    self.analysis_ = self._tune_run(X, y, config, resources_per_trial,
  File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_gridsearch.py", line 302, in _tune_run
    analysis = tune.run(trainable, **run_args)
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/tune.py", line 718, in run
    runner.step()
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 778, in step
    self._wait_and_handle_event(next_trial)
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 755, in _wait_and_handle_event
    raise e
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 736, in _wait_and_handle_event
    self._on_executor_error(trial, result[ExecutorEvent.KEY_EXCEPTION])
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 884, in _on_executor_error
    raise e
ray.tune.error.TuneGetNextExecutorEventError: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1833, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=5392, ip=ec2-44-201-222-82.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f00278cea60>)
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.

(_Trainable pid=5392) 2022-06-15 21:47:10,520   WARNING worker.py:1829 -- Local object store memory usage:
(_Trainable pid=5392) 
(_Trainable pid=5392) (global lru) capacity: 4633767936
(_Trainable pid=5392) (global lru) used: 0%
(_Trainable pid=5392) (global lru) num objects: 0
(_Trainable pid=5392) (global lru) num evictions: 0
(_Trainable pid=5392) (global lru) bytes evicted: 0
(_Trainable pid=5392) 
(_Trainable pid=5392) 2022-06-15 21:47:10,522   ERROR worker.py:451 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=5392, ip=ec2-44-201-222-82.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f00278cea60>)
(_Trainable pid=5392)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
(_Trainable pid=5392)     self.setup(copy.deepcopy(self.config))
(_Trainable pid=5392)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
(_Trainable pid=5392)     setup_kwargs[k] = parameter_registry.get(prefix + k)
(_Trainable pid=5392)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
(_Trainable pid=5392)     return ray.get(self.references[k])
(_Trainable pid=5392) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(_Trainable pid=5392) 
(_Trainable pid=5392) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.
(_Trainable pid=6191) 2022-06-15 21:47:11,004   WARNING worker.py:1829 -- Local object store memory usage:
(_Trainable pid=6191) 
(_Trainable pid=6191) (global lru) capacity: 4631811686
(_Trainable pid=6191) (global lru) used: 0%
(_Trainable pid=6191) (global lru) num objects: 0
(_Trainable pid=6191) (global lru) num evictions: 0
(_Trainable pid=6191) (global lru) bytes evicted: 0
(_Trainable pid=6191) 
(_Trainable pid=6191) 2022-06-15 21:47:11,006   ERROR worker.py:451 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=6191, ip=ec2-44-204-187-139.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f53b96e1a60>)
(_Trainable pid=6191)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
(_Trainable pid=6191)     self.setup(copy.deepcopy(self.config))
(_Trainable pid=6191)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
(_Trainable pid=6191)     setup_kwargs[k] = parameter_registry.get(prefix + k)
(_Trainable pid=6191)   File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
(_Trainable pid=6191)     return ray.get(self.references[k])
(_Trainable pid=6191) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(_Trainable pid=6191) 
(_Trainable pid=6191) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.

Hi @daquang, are you running this on a clean environment?

I just ran this locally on a fresh conda environment and was able to successfully run it:

conda create -n tune-sklearn-2  python=3.8 -y
conda activate tune-sklearn-2
pip install "ray[tune]" tune-sklearn packaging
python sgd.py
0.9388888888888889

Hi @matthewdeng

Thanks for the quick reply. I forgot to mention. On a single node (ie, just running sgd.py as is), the script runs fine. However, I want to run this in parallel on a cluster. I have a cluster consisting of 5 nodes. I added the following lines to the top of the sgd.py script to ensure the scripts runs in parallel on the cluster:

import ray
ray.shutdown()
ray.init(address=‘auto’)

As you can tell, I am running the modified script on the head node.

What happens if you don’t include ray.shutdown()?

I’m still getting the same result even after removing ray.shutdown()

Could you share your cluster config? I wasn’t able to reproduce this on a 5-node cluster.

Hi Matthew,

The host operating system is Linux (Ubuntu 20.4)
The containers are Linux containers (LXC).

We’re developing this on the DNAnexus platform (https://documentation.dnanexus.com/). I’m spinning up a cluster of 5 identical EC2 instances. We opened up all the ports from 6500 to 65535 to see if that resolves the issue, but that didn’t help.

Reproduction script

Run these statements in the nodes containers.

ray start --head --node-ip-address=“x.x.x.x” --port=6379 --dashboard-host=x.x.x.x --dashboard-port=443

and for worker nodes

ray start --address=“$head_node_ip:6379” --node-ip-address=“y.y.y.y”

Usually we set the head ip address x.x.x.x to 0.0.0.0, but I found setting it to the container IP address 10.0.3.103 yields fewer issues. However, GridSearchCV.fit still has issues while Tune.run is fine.

Also I should add: running on Python 3.8 with Ray 1.13.0

As a test, can you try opening up all the ports? Seems like it could be the port used for object store transfer isn’t working properly.

Also can you share what you’re doing for tune.run script? Another thing you can try is to use tune.with_parameters, which should go through the same code path of the error you’re seeing.

For the tune.run example, I’m running the basic example (tune_basic_example — Ray 1.13.0), but changing it so that it runs ray.init(address=‘auto’) before it calls tune.run. So far, the basic example works pretty well. I can try using tune.with_parameters next

Unfortunately, opening ALL ports only leads to the same errors.

I believe you are right about the object store transfer not working properly. I noticed that the plasma percentages never change in the dashboard. I also ran into similar errors while running this example over a cluster: ray/train_fashion_mnist_example.py at master · ray-project/ray · GitHub

Is there a way to test the object store transfer?

Hey looks like you were able to create a new topic here, makes sense as this is more of a Core issue.

Depending on the nature of the problem, you may be able to simplify the repro even further. In the example below, we force f to be executed on a worker node, and try to access the results (from object store) on the head node.

import ray
import numpy as np

@ray.remote
def f():
    return np.arange(100_000)

ip_resource = "node:<WORKER_NODE_IP>"
result = ray.get(f.options(resources={ip_resource: 0.01}).remote())

I agree with you this is more of a Core issue. Thank you for your help in diagnosing the true problem!

Problem solved. Turns out it required me to set the --node-ip-address argument to the head node’s ip address for the ray start --head command and to change the ray.init(address=‘auto’) line to ray.init(address=‘auto’, _node_ip_address=“head node’s ip address”)

1 Like