How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Most of the examples on Ray Tune works for me (ray/python/ray/tune/examples at master · ray-project/ray · GitHub); however when I try to run an example on tune-sklearn that uses TuneGridSearchCV, I get errors (for example: sgd.py on the tune-sklearn github).
I am running on an LXC container (Ubuntu 20.4), with Ray 1.13.0. On the bottom you can see my error and I’ve also attached the log file:
2022-06-15 21:47:10,529 ERROR trial_runner.py:883 -- Trial _Trainable_b3fe3_00002: Error processing event.
Traceback (most recent call last):
File "sgd.py", line 33, in <module>
tune_search.fit(x_train, y_train)
File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_basesearch.py", line 622, in fit
return self._fit(X, y, groups, tune_params, **fit_params)
File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_basesearch.py", line 533, in _fit
self.analysis_ = self._tune_run(X, y, config, resources_per_trial,
File "/usr/local/lib/python3.8/dist-packages/tune_sklearn/tune_gridsearch.py", line 302, in _tune_run
analysis = tune.run(trainable, **run_args)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/tune.py", line 718, in run
runner.step()
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 778, in step
self._wait_and_handle_event(next_trial)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 755, in _wait_and_handle_event
raise e
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 736, in _wait_and_handle_event
self._on_executor_error(trial, result[ExecutorEvent.KEY_EXCEPTION])
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trial_runner.py", line 884, in _on_executor_error
raise e
ray.tune.error.TuneGetNextExecutorEventError: Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/ray/tune/ray_trial_executor.py", line 934, in get_next_executor_event
future_result = ray.get(ready_future)
File "/usr/local/lib/python3.8/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 1833, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=5392, ip=ec2-44-201-222-82.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f00278cea60>)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
self.setup(copy.deepcopy(self.config))
File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
setup_kwargs[k] = parameter_registry.get(prefix + k)
File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.
(_Trainable pid=5392) 2022-06-15 21:47:10,520 WARNING worker.py:1829 -- Local object store memory usage:
(_Trainable pid=5392)
(_Trainable pid=5392) (global lru) capacity: 4633767936
(_Trainable pid=5392) (global lru) used: 0%
(_Trainable pid=5392) (global lru) num objects: 0
(_Trainable pid=5392) (global lru) num evictions: 0
(_Trainable pid=5392) (global lru) bytes evicted: 0
(_Trainable pid=5392)
(_Trainable pid=5392) 2022-06-15 21:47:10,522 ERROR worker.py:451 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=5392, ip=ec2-44-201-222-82.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f00278cea60>)
(_Trainable pid=5392) File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
(_Trainable pid=5392) self.setup(copy.deepcopy(self.config))
(_Trainable pid=5392) File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
(_Trainable pid=5392) setup_kwargs[k] = parameter_registry.get(prefix + k)
(_Trainable pid=5392) File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
(_Trainable pid=5392) return ray.get(self.references[k])
(_Trainable pid=5392) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(_Trainable pid=5392)
(_Trainable pid=5392) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.
(_Trainable pid=6191) 2022-06-15 21:47:11,004 WARNING worker.py:1829 -- Local object store memory usage:
(_Trainable pid=6191)
(_Trainable pid=6191) (global lru) capacity: 4631811686
(_Trainable pid=6191) (global lru) used: 0%
(_Trainable pid=6191) (global lru) num objects: 0
(_Trainable pid=6191) (global lru) num evictions: 0
(_Trainable pid=6191) (global lru) bytes evicted: 0
(_Trainable pid=6191)
(_Trainable pid=6191) 2022-06-15 21:47:11,006 ERROR worker.py:451 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=6191, ip=ec2-44-204-187-139.compute-1.amazonaws.com, repr=<ray.tune.utils.trainable._Trainable object at 0x7f53b96e1a60>)
(_Trainable pid=6191) File "/usr/local/lib/python3.8/dist-packages/ray/tune/trainable.py", line 156, in __init__
(_Trainable pid=6191) self.setup(copy.deepcopy(self.config))
(_Trainable pid=6191) File "/usr/local/lib/python3.8/dist-packages/ray/tune/utils/trainable.py", line 389, in setup
(_Trainable pid=6191) setup_kwargs[k] = parameter_registry.get(prefix + k)
(_Trainable pid=6191) File "/usr/local/lib/python3.8/dist-packages/ray/tune/registry.py", line 225, in get
(_Trainable pid=6191) return ray.get(self.references[k])
(_Trainable pid=6191) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff2600000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(_Trainable pid=6191)
(_Trainable pid=6191) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*26000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.