Ray Train RuntimeError: unable to write to file </torch_1602_2842463136>

Hi,
I’m using Ray Train with PyTorch, and have this error. What is going on? the same code works fine without Ray Train

f9g8vvlswy-algo-1-r36tx | Invoking script with the following command:
f9g8vvlswy-algo-1-r36tx | 
f9g8vvlswy-algo-1-r36tx | /opt/conda/bin/python3.6 train_ray.py --batch 5 --bucket pdx-sagemaker-a2d2-test --cache /opt/ml/input/data/dataset --epochs 10 --height 604 --log-freq 500 --lr 0.183 --lr_decay_per_epoch 0.3 --lr_warmup_ratio 0.1 --momentum 0.928 --network deeplabv3_resnet101 --prefetch 2 --width 960 --workers 10
f9g8vvlswy-algo-1-r36tx | 
f9g8vvlswy-algo-1-r36tx | 

f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) In epoch 0 learning rate: 0.0183000000
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) [2022-01-07 15:56:18.517 algo-1-r36tx:242 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) [2022-01-07 15:56:18.601 algo-1-r36tx:242 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:06,876	WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:09,658	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:10,069	INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-01-07_15-56-05/run_001
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /root/.cache/torch/hub/v0.9.1.zip
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /root/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
  0%|          | 0.00/170M [00:00<?, ?B/s]6m(BaseWorkerMixin pid=242) 
 10%|▉         | 16.6M/170M [00:00<00:00, 174MB/s]orkerMixin pid=242) 
 25%|██▍       | 41.8M/170M [00:00<00:00, 227MB/s]orkerMixin pid=242) 
 40%|███▉      | 67.6M/170M [00:00<00:00, 247MB/s]orkerMixin pid=242) 
 55%|█████▍    | 93.6M/170M [00:00<00:00, 257MB/s]orkerMixin pid=242) 
 70%|██████▉   | 119M/170M [00:00<00:00, 260MB/s] orkerMixin pid=242) 
 85%|████████▍ | 145M/170M [00:00<00:00, 264MB/s]WorkerMixin pid=242) 
100%|██████████| 170M/170M [00:00<00:00, 256MB/s]WorkerMixin pid=242) 
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:13,300	INFO torch.py:239 -- Moving model to device: cuda:0
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 
f9g8vvlswy-algo-1-r36tx | Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx |   File "train_ray.py", line 221, in <module>
f9g8vvlswy-algo-1-r36tx |     trainer.run(train_func)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 281, in run
f9g8vvlswy-algo-1-r36tx |     for intermediate_result in iterator:
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 651, in __next__
f9g8vvlswy-algo-1-r36tx |     self._finish_training)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
f9g8vvlswy-algo-1-r36tx |     return func()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 721, in _finish_training
f9g8vvlswy-algo-1-r36tx |     return ray.get(self._backend_executor_actor.finish_training.remote())
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
f9g8vvlswy-algo-1-r36tx |     return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/worker.py", line 1713, in get
f9g8vvlswy-algo-1-r36tx |     raise value.as_instanceof_cause()
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=171, ip=172.18.0.2, repr=<ray.train.backend.BackendExecutor object at 0x7fd1872fc048>)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 507, in finish_training
f9g8vvlswy-algo-1-r36tx |     results = self.get_with_failure_handling(futures)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
f9g8vvlswy-algo-1-r36tx |     success, failed_worker_indexes = check_for_failure(remote_values)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 42, in check_for_failure
f9g8vvlswy-algo-1-r36tx |     ray.get(object_ref)
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=242, ip=172.18.0.2, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fad2f577e80>)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/worker_group.py", line 26, in __execute
f9g8vvlswy-algo-1-r36tx |     return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 498, in end_training
f9g8vvlswy-algo-1-r36tx |     output = session.finish()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/session.py", line 102, in finish
f9g8vvlswy-algo-1-r36tx |     func_output = self.training_thread.join()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 94, in join
f9g8vvlswy-algo-1-r36tx |     raise self.exc
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 87, in run
f9g8vvlswy-algo-1-r36tx |     self.ret = self._target(*self._args, **self._kwargs)
f9g8vvlswy-algo-1-r36tx |   File "train_ray.py", line 162, in train_func
f9g8vvlswy-algo-1-r36tx |     for i, batch in enumerate(train_loader):
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/torch.py", line 192, in __iter__
f9g8vvlswy-algo-1-r36tx |     for item in iterator:
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 525, in __next__
f9g8vvlswy-algo-1-r36tx |     (data, worker_id) = self._next_data()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1273, in _next_data
f9g8vvlswy-algo-1-r36tx |     return (self._process_data(data), w_id)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1299, in _process_data
f9g8vvlswy-algo-1-r36tx |     data.reraise()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
f9g8vvlswy-algo-1-r36tx |     raise self.exc_type(msg)
f9g8vvlswy-algo-1-r36tx | RuntimeError: Caught RuntimeError in DataLoader worker process 0.
f9g8vvlswy-algo-1-r36tx | Original Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 210, in _worker_loop
f9g8vvlswy-algo-1-r36tx |     data = fetcher.fetch(index)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
f9g8vvlswy-algo-1-r36tx |     return self.collate_fn(data)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
f9g8vvlswy-algo-1-r36tx |     return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
f9g8vvlswy-algo-1-r36tx |     return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 53, in default_collate
f9g8vvlswy-algo-1-r36tx |     storage = elem.storage()._new_shared(numel)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/storage.py", line 157, in _new_shared
f9g8vvlswy-algo-1-r36tx |     return cls._new_using_fd(size)
f9g8vvlswy-algo-1-r36tx | RuntimeError: unable to write to file </torch_1602_2842463136>
f9g8vvlswy-algo-1-r36tx | 
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:26,407 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
f9g8vvlswy-algo-1-r36tx | Command "/opt/conda/bin/python3.6 train_ray.py --batch 5 --bucket pdx-sagemaker-a2d2-test --cache /opt/ml/input/data/dataset --epochs 10 --height 604 --log-freq 500 --lr 0.183 --lr_decay_per_epoch 0.3 --lr_warmup_ratio 0.1 --momentum 0.928 --network deeplabv3_resnet101 --prefetch 2 --width 960 --workers 10"
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:06,876	WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:09,658	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:10,069	INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-01-07_15-56-05/run_001
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /root/.cache/torch/hub/v0.9.1.zip
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /root/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
  0%|          | 0.00/170M [00:00<?, ?B/s]6m(BaseWorkerMixin pid=242) 
 10%|▉         | 16.6M/170M [00:00<00:00, 174MB/s]kerMixin pid=242) 
 25%|██▍       | 41.8M/170M [00:00<00:00, 227MB/s]ixin pid=242) 
 40%|███▉      | 67.6M/170M [00:00<00:00, 247MB/s]in pid=242) 
 55%|█████▍    | 93.6M/170M [00:00<00:00, 257MB/s]id=242) 
 70%|██████▉   | 119M/170M [00:00<00:00, 260MB/s] =242) 
 85%|████████▍ | 145M/170M [00:00<00:00, 264MB/s]2) 
100%|██████████| 170M/170M [00:00<00:00, 256MB/s] 
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:13,300	INFO torch.py:239 -- Moving model to device: cuda:0
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 
f9g8vvlswy-algo-1-r36tx | Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx |   File "train_ray.py", line 221, in <module>
f9g8vvlswy-algo-1-r36tx |     trainer.run(train_func)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 281, in run
f9g8vvlswy-algo-1-r36tx |     for intermediate_result in iterator:
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 651, in __next__
f9g8vvlswy-algo-1-r36tx |     self._finish_training)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
f9g8vvlswy-algo-1-r36tx |     return func()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 721, in _finish_training
f9g8vvlswy-algo-1-r36tx |     return ray.get(self._backend_executor_actor.finish_training.remote())
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
f9g8vvlswy-algo-1-r36tx |     return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/worker.py", line 1713, in get
f9g8vvlswy-algo-1-r36tx |     raise value.as_instanceof_cause()
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=171, ip=172.18.0.2, repr=<ray.train.backend.BackendExecutor object at 0x7fd1872fc048>)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 507, in finish_training
f9g8vvlswy-algo-1-r36tx |     results = self.get_with_failure_handling(futures)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
f9g8vvlswy-algo-1-r36tx |     success, failed_worker_indexes = check_for_failure(remote_values)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 42, in check_for_failure
f9g8vvlswy-algo-1-r36tx |     ray.get(object_ref)
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=242, ip=172.18.0.2, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fad2f577e80>)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/worker_group.py", line 26, in __execute
f9g8vvlswy-algo-1-r36tx |     return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 498, in end_training
f9g8vvlswy-algo-1-r36tx |     output = session.finish()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/session.py", line 102, in finish
f9g8vvlswy-algo-1-r36tx |     func_output = self.training_thread.join()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 94, in join
f9g8vvlswy-algo-1-r36tx |     raise self.exc
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 87, in run
f9g8vvlswy-algo-1-r36tx |     self.ret = self._target(*self._args, **self._kwargs)
f9g8vvlswy-algo-1-r36tx |   File "train_ray.py", line 162, in train_func
f9g8vvlswy-algo-1-r36tx |     for i, batch in enumerate(train_loader):
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/ray/train/torch.py", line 192, in __iter__
f9g8vvlswy-algo-1-r36tx |     for item in iterator:
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 525, in __next__
f9g8vvlswy-algo-1-r36tx |     (data, worker_id) = self._next_data()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1273, in _next_data
f9g8vvlswy-algo-1-r36tx |     return (self._process_data(data), w_id)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1299, in _process_data
f9g8vvlswy-algo-1-r36tx |     data.reraise()
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
f9g8vvlswy-algo-1-r36tx |     raise self.exc_type(msg)
f9g8vvlswy-algo-1-r36tx | RuntimeError: Caught RuntimeError in DataLoader worker process 0.
f9g8vvlswy-algo-1-r36tx | Original Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 210, in _worker_loop
f9g8vvlswy-algo-1-r36tx |     data = fetcher.fetch(index)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
f9g8vvlswy-algo-1-r36tx |     return self.collate_fn(data)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
f9g8vvlswy-algo-1-r36tx |     return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
f9g8vvlswy-algo-1-r36tx |     return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 53, in default_collate
f9g8vvlswy-algo-1-r36tx |     storage = elem.storage()._new_shared(numel)
f9g8vvlswy-algo-1-r36tx |   File "/opt/conda/lib/python3.6/site-packages/torch/storage.py", line 157, in _new_shared
f9g8vvlswy-algo-1-r36tx |     return cls._new_using_fd(size)
f9g8vvlswy-algo-1-r36tx | RuntimeError: unable to write to file </torch_1602_2842463136>

Hmm, when you say the same code works without Ray Train, are you able to execute it with just the following modification?

- trainer.run(train_func)
+ train_func()

From the logs there is some evidence that the overall shared memory is small at 64 MiB, I’m not sure how much the DataLoader requires but it is possible that the overhead of running Ray puts you over the boundary. Are you able to increase the shared memory size?

I meant it works before I do the 6 steps necessary to transform the code to Ray Train (screenshot from a Slack thread)

As @matthewdeng suggested, this might be because there is not enough shared memory to use multiple workers for data loading.

Does this work if you set num_workers=0 in your DataLoader?

There’s also this thread that has some suggestions: Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel - #17 by ggjy - vision - PyTorch Forums.