Ray tune trials fail due to unexpected worker exit

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I am new to using ray tune for parameter optimization. I followed this example - How to use Tune with PyTorch — Ray 2.10.0 to write my own code to perform a hyperparameter sweep.
I also tested this tutorial in my own environment and it worked fine so there is some issue when I try to replicate the above with my own code for training.

Here is error log -
2024-03-28 13:08:49,895 ERROR tune_controller.py:1332 – Trial task failed for trial train_raytune_36d7f_00000
Traceback (most recent call last):
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py”, line 110, in resolve_future
result = ray.get(future)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/auto_init_hook.py”, line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/worker.py”, line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/worker.py”, line 866, in get_objects
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 4d5b5d4750a02b9e81d49ecc01000000
pid: 468929
namespace: 5fec80a5-be18-4e76-b070-24d50c52c44a
ip: 10.141.26.182
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 1883, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1984, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1889, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1830, in ray._raylet.execute_task.function_executor
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/function_manager.py”, line 724, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py”, line 467, in _resume_span
return method(self, *_args, **_kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/trainable.py”, line 334, in train
raise skipped from exception_cause(skipped)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/air/_internal/util.py”, line 88, in run
self._ret = self._target(*self._args, **self._kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py”, line 53, in
training_func=lambda: self._trainable_func(self.config),
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py”, line 467, in _resume_span
return method(self, *_args, **_kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py”, line 261, in _trainable_func
output = fn()
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/util.py”, line 130, in inner
return trainable(config, **fn_kwargs)
File “/tmp/ipykernel_459985/2726713512.py”, line 91, in train_raytune
AttributeError: ‘Tensor’ object has no attribute ‘numpu’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1245, in dump
return super().dump(obj)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/tblib/pickling_support.py”, line 46, in pickle_exception
rv = obj.reduce_ex(3)
RecursionError: maximum recursion depth exceeded while calling a Python object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 2281, in ray._raylet.task_execution_handler
File “python/ray/_raylet.pyx”, line 2177, in ray._raylet.execute_task_with_cancellation_handler
File “python/ray/_raylet.pyx”, line 1832, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1833, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 2071, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1089, in ray._raylet.store_task_errors
File “python/ray/_raylet.pyx”, line 4575, in ray._raylet.CoreWorker.store_task_outputs
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/serialization.py”, line 494, in serialize
return self._serialize_to_msgpack(value)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/serialization.py”, line 449, in _serialize_to_msgpack
value = value.to_bytes()
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/exceptions.py”, line 32, in to_bytes
serialized_exception=pickle.dumps(self),
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1479, in dumps
cp.dump(obj)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1249, in dump
raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
An unexpected internal error occurred while the worker was executing a task.
2024-03-28 13:08:50,108 ERROR tune_controller.py:1332 – Trial task failed for trial train_raytune_36d7f_00001
Traceback (most recent call last):
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py”, line 110, in resolve_future
result = ray.get(future)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/auto_init_hook.py”, line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/worker.py”, line 2667, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/worker.py”, line 866, in get_objects
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: a09a9258d0748b83315d473c01000000
pid: 468930
namespace: 5fec80a5-be18-4e76-b070-24d50c52c44a
ip: 10.141.26.182
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 1883, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1984, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1889, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1830, in ray._raylet.execute_task.function_executor
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/function_manager.py”, line 724, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py”, line 467, in _resume_span
return method(self, *_args, **_kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/trainable.py”, line 334, in train
raise skipped from exception_cause(skipped)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/air/_internal/util.py”, line 88, in run
self._ret = self._target(*self._args, **self._kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py”, line 53, in
training_func=lambda: self._trainable_func(self.config),
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py”, line 467, in _resume_span
return method(self, *_args, **_kwargs)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py”, line 261, in _trainable_func
output = fn()
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/tune/trainable/util.py”, line 130, in inner
return trainable(config, **fn_kwargs)
File “/tmp/ipykernel_459985/2726713512.py”, line 91, in train_raytune
AttributeError: ‘Tensor’ object has no attribute ‘numpu’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1245, in dump
return super().dump(obj)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/tblib/pickling_support.py”, line 46, in pickle_exception
rv = obj.reduce_ex(3)
RecursionError: maximum recursion depth exceeded while calling a Python object

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 2281, in ray._raylet.task_execution_handler
File “python/ray/_raylet.pyx”, line 2177, in ray._raylet.execute_task_with_cancellation_handler
File “python/ray/_raylet.pyx”, line 1832, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1833, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 2071, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 1089, in ray._raylet.store_task_errors
File “python/ray/_raylet.pyx”, line 4575, in ray._raylet.CoreWorker.store_task_outputs
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/serialization.py”, line 494, in serialize
return self._serialize_to_msgpack(value)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/_private/serialization.py”, line 449, in _serialize_to_msgpack
value = value.to_bytes()
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/exceptions.py”, line 32, in to_bytes
serialized_exception=pickle.dumps(self),
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1479, in dumps
cp.dump(obj)
File “/home/simran/anaconda3/envs/deepcrc/lib/python3.9/site-packages/ray/cloudpickle/cloudpickle.py”, line 1249, in dump
raise pickle.PicklingError(msg) from e
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
An unexpected internal error occurred while the worker was executing a task.
2024-03-28 13:08:50,139 INFO tune.py:1016 – Wrote the latest version of all result files and experiment state to ‘/home/simran/ray_results/train_raytune_2024-03-28_13-08-43’ in 0.0254s.
2024-03-28 13:08:50,148 ERROR tune.py:1044 – Trials did not complete: [train_raytune_36d7f_00000, train_raytune_36d7f_00001]
2024-03-28 13:08:50,149 INFO tune.py:1048 – Total run time: 6.22 seconds (6.16 seconds for the tuning loop).

Is there a typo somewhere in your trainable definition? I’m guessing this is expected to be numpy.