RuntimeError : Socket Timeout (ProcessGoupGloo)

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Description

Hello,

I am facing an issue when using a Ray Strategy (from Ray Lightning), on my Pytorch Lightning Trainer. When I train my model on a sample of my dataset, it works perfectly, but when I use the full dataset, I reach a Socket Timeout.
Here is a full stacktrace:

Traceback (most recent call last):
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\train_model_topology.py”, line 86, in
train_model(config)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\train_model_topology.py”, line 73, in train_model
trainer.fit(model, train_data_loader, valid_data_loader)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\trainers\base_trainer.py”, line 63, in fit
super().fit(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py”, line 771, in fit
self._call_and_handle_interrupt(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\pytorch_lightning\trainer\trainer.py”, line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 58, in launch
ray_output = self.run_function_on_workers(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 249, in run_function_on_workers
results = process_results(self._futures, self.tune_queue)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\util.py”, line 64, in process_results
ray.get(ready)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\worker.py”, line 2280, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=19740, ip=127.0.0.1, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x00000285110DA9D0>)
File “python\ray_raylet.pyx”, line 662, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 666, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 613, in ray._raylet.execute_task.function_executor
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_private\function_manager.py”, line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 466, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\utils.py”, line 52, in execute
return fn(*args, **kwargs)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\launchers\ray_launcher.py”, line 295, in _wrapping_function
self._strategy._worker_setup(process_idx=global_rank)
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\ray_lightning\ray_ddp.py”, line 192, in _worker_setup
torch.distributed.init_process_group(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\torch\distributed\distributed_c10d.py”, line 602, in init_process_group
default_pg = _new_process_group_helper(
File “C:\Users\Leonard.Caquot\PycharmProjects\ai-developments\venv\lib\site-packages\torch\distributed\distributed_c10d.py”, line 703, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: Socket Timeout

Hey @lcaquot, few questions:

  1. Could you share a reproduction script?
  2. Does this work with pure PyTorch Lightning (i.e., without Ray)?
  3. How large is your dataset?

If your dataset is large, the timeout parameter of torch.distributed.init_process_group might be too low.

Hello bveeramani,

I am sorry but the bug is quite difficult to reproduce as it depends on the dataset that is used.
To answer the the two other questions, it works perfectly with pure PyTorch Lightning, but in that case I don’t use distributed training at all. Could be interested to try with PL + distribution on SLURM for example, and see if the runtime error is still there.

The dataset is a dataset of graphs (I am working on Graph Neural Nerworks), with approximately 20k graphs, with around 10 to 20 nodes per graphs. Each node initial embedding is sized 1x150 approximately.

So I have around 300000 input tensors sized 1x150 in my training dataset. When I reduce the dataset size, I manage to make it work with Ray…
Do you know if it is possible to change the the timeout parameter oftorch.distributed.init_process_group ?

Thank you

Thanks for the info!

Unfortunately, I don’t think there’s anyway to change the timeout parameter with Ray-Lightning. Seems like the parameter isn’t exposed: ray_lightning/ray_ddp.py at 34e6443e939f4d4895007db444ccd4c886e20d30 · ray-project/ray_lightning · GitHub.