How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Description
Hello,
I am facing an issue when using a Ray Strategy (from Ray Lightning), on my Pytorch Lightning Trainer. When I train my model on a sample of my dataset, it works perfectly, but when I use the full dataset, I reach a Socket Timeout.
Here is a full stacktrace:
Traceback (most recent call last):
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\train_model_topology.py”, line 86, in
train_model(config)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\train_model_topology.py”, line 73, in train_model
trainer.fit(model, train_data_loader, valid_data_loader)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\trainers\base_trainer.py”, line 63, in fit
super().fit(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py”, line 771, in fit
self._call_and_handle_interrupt(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py”, line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 58, in launch
ray_output = self.run_function_on_workers(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 249, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\util.py”, line 64, in process_results
ray.get(ready)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\worker.py”, line 2280, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=19740, ip=127.0.0.1, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x00000285110DA9D0>)
File “python\ray_raylet.pyx”, line 662, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 666, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 613, in ray._raylet.execute_task.function_executor
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\function_manager.py”, line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 466, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\utils.py”, line 52, in execute
return fn(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 295, in _wrapping_function
self._strategy._worker_setup(process_idx=global_rank)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\ray_ddp.py”, line 192, in _worker_setup
torch.distributed.init_process_group(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\torch\distributed\distributed_c10d.py”, line 602, in init_process_group
default_pg = _new_process_group_helper(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\torch\distributed\distributed_c10d.py”, line 703, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: Socket Timeout