Hi,
I’m using Ray Train with PyTorch, and have this error. What is going on? the same code works fine without Ray Train
f9g8vvlswy-algo-1-r36tx | Invoking script with the following command:
f9g8vvlswy-algo-1-r36tx |
f9g8vvlswy-algo-1-r36tx | /opt/conda/bin/python3.6 train_ray.py --batch 5 --bucket pdx-sagemaker-a2d2-test --cache /opt/ml/input/data/dataset --epochs 10 --height 604 --log-freq 500 --lr 0.183 --lr_decay_per_epoch 0.3 --lr_warmup_ratio 0.1 --momentum 0.928 --network deeplabv3_resnet101 --prefetch 2 --width 960 --workers 10
f9g8vvlswy-algo-1-r36tx |
f9g8vvlswy-algo-1-r36tx |
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) In epoch 0 learning rate: 0.0183000000
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) [2022-01-07 15:56:18.517 algo-1-r36tx:242 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) [2022-01-07 15:56:18.601 algo-1-r36tx:242 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:06,876 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:09,658 INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:10,069 INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-01-07_15-56-05/run_001
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /root/.cache/torch/hub/v0.9.1.zip
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /root/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
0%| | 0.00/170M [00:00<?, ?B/s]6m(BaseWorkerMixin pid=242)
10%|▉ | 16.6M/170M [00:00<00:00, 174MB/s]orkerMixin pid=242)
25%|██▍ | 41.8M/170M [00:00<00:00, 227MB/s]orkerMixin pid=242)
40%|███▉ | 67.6M/170M [00:00<00:00, 247MB/s]orkerMixin pid=242)
55%|█████▍ | 93.6M/170M [00:00<00:00, 257MB/s]orkerMixin pid=242)
70%|██████▉ | 119M/170M [00:00<00:00, 260MB/s] orkerMixin pid=242)
85%|████████▍ | 145M/170M [00:00<00:00, 264MB/s]WorkerMixin pid=242)
100%|██████████| 170M/170M [00:00<00:00, 256MB/s]WorkerMixin pid=242)
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:13,300 INFO torch.py:239 -- Moving model to device: cuda:0
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242)
f9g8vvlswy-algo-1-r36tx | Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx | File "train_ray.py", line 221, in <module>
f9g8vvlswy-algo-1-r36tx | trainer.run(train_func)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 281, in run
f9g8vvlswy-algo-1-r36tx | for intermediate_result in iterator:
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 651, in __next__
f9g8vvlswy-algo-1-r36tx | self._finish_training)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
f9g8vvlswy-algo-1-r36tx | return func()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 721, in _finish_training
f9g8vvlswy-algo-1-r36tx | return ray.get(self._backend_executor_actor.finish_training.remote())
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
f9g8vvlswy-algo-1-r36tx | return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/worker.py", line 1713, in get
f9g8vvlswy-algo-1-r36tx | raise value.as_instanceof_cause()
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=171, ip=172.18.0.2, repr=<ray.train.backend.BackendExecutor object at 0x7fd1872fc048>)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 507, in finish_training
f9g8vvlswy-algo-1-r36tx | results = self.get_with_failure_handling(futures)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
f9g8vvlswy-algo-1-r36tx | success, failed_worker_indexes = check_for_failure(remote_values)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 42, in check_for_failure
f9g8vvlswy-algo-1-r36tx | ray.get(object_ref)
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=242, ip=172.18.0.2, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fad2f577e80>)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/worker_group.py", line 26, in __execute
f9g8vvlswy-algo-1-r36tx | return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 498, in end_training
f9g8vvlswy-algo-1-r36tx | output = session.finish()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/session.py", line 102, in finish
f9g8vvlswy-algo-1-r36tx | func_output = self.training_thread.join()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 94, in join
f9g8vvlswy-algo-1-r36tx | raise self.exc
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 87, in run
f9g8vvlswy-algo-1-r36tx | self.ret = self._target(*self._args, **self._kwargs)
f9g8vvlswy-algo-1-r36tx | File "train_ray.py", line 162, in train_func
f9g8vvlswy-algo-1-r36tx | for i, batch in enumerate(train_loader):
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/torch.py", line 192, in __iter__
f9g8vvlswy-algo-1-r36tx | for item in iterator:
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 525, in __next__
f9g8vvlswy-algo-1-r36tx | (data, worker_id) = self._next_data()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1273, in _next_data
f9g8vvlswy-algo-1-r36tx | return (self._process_data(data), w_id)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1299, in _process_data
f9g8vvlswy-algo-1-r36tx | data.reraise()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
f9g8vvlswy-algo-1-r36tx | raise self.exc_type(msg)
f9g8vvlswy-algo-1-r36tx | RuntimeError: Caught RuntimeError in DataLoader worker process 0.
f9g8vvlswy-algo-1-r36tx | Original Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 210, in _worker_loop
f9g8vvlswy-algo-1-r36tx | data = fetcher.fetch(index)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
f9g8vvlswy-algo-1-r36tx | return self.collate_fn(data)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
f9g8vvlswy-algo-1-r36tx | return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
f9g8vvlswy-algo-1-r36tx | return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 53, in default_collate
f9g8vvlswy-algo-1-r36tx | storage = elem.storage()._new_shared(numel)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/storage.py", line 157, in _new_shared
f9g8vvlswy-algo-1-r36tx | return cls._new_using_fd(size)
f9g8vvlswy-algo-1-r36tx | RuntimeError: unable to write to file </torch_1602_2842463136>
f9g8vvlswy-algo-1-r36tx |
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:26,407 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
f9g8vvlswy-algo-1-r36tx | Command "/opt/conda/bin/python3.6 train_ray.py --batch 5 --bucket pdx-sagemaker-a2d2-test --cache /opt/ml/input/data/dataset --epochs 10 --height 604 --log-freq 500 --lr 0.183 --lr_decay_per_epoch 0.3 --lr_warmup_ratio 0.1 --momentum 0.928 --network deeplabv3_resnet101 --prefetch 2 --width 960 --workers 10"
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:06,876 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:09,658 INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
f9g8vvlswy-algo-1-r36tx | 2022-01-07 15:56:10,069 INFO trainer.py:178 -- Run results will be logged in: /root/ray_results/train_2022-01-07_15-56-05/run_001
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /root/.cache/torch/hub/v0.9.1.zip
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /root/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
0%| | 0.00/170M [00:00<?, ?B/s]6m(BaseWorkerMixin pid=242)
10%|â | 16.6M/170M [00:00<00:00, 174MB/s]kerMixin pid=242)
25%|âââ | 41.8M/170M [00:00<00:00, 227MB/s]ixin pid=242)
40%|ââââ | 67.6M/170M [00:00<00:00, 247MB/s]in pid=242)
55%|ââââââ | 93.6M/170M [00:00<00:00, 257MB/s]id=242)
70%|âââââââ | 119M/170M [00:00<00:00, 260MB/s] =242)
85%|âââââââââ | 145M/170M [00:00<00:00, 264MB/s]2)
100%|ââââââââââ| 170M/170M [00:00<00:00, 256MB/s]
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) 2022-01-07 15:56:13,300 INFO torch.py:239 -- Moving model to device: cuda:0
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
f9g8vvlswy-algo-1-r36tx | (BaseWorkerMixin pid=242)
f9g8vvlswy-algo-1-r36tx | Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx | File "train_ray.py", line 221, in <module>
f9g8vvlswy-algo-1-r36tx | trainer.run(train_func)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 281, in run
f9g8vvlswy-algo-1-r36tx | for intermediate_result in iterator:
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 651, in __next__
f9g8vvlswy-algo-1-r36tx | self._finish_training)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
f9g8vvlswy-algo-1-r36tx | return func()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/trainer.py", line 721, in _finish_training
f9g8vvlswy-algo-1-r36tx | return ray.get(self._backend_executor_actor.finish_training.remote())
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
f9g8vvlswy-algo-1-r36tx | return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/worker.py", line 1713, in get
f9g8vvlswy-algo-1-r36tx | raise value.as_instanceof_cause()
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=171, ip=172.18.0.2, repr=<ray.train.backend.BackendExecutor object at 0x7fd1872fc048>)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 507, in finish_training
f9g8vvlswy-algo-1-r36tx | results = self.get_with_failure_handling(futures)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
f9g8vvlswy-algo-1-r36tx | success, failed_worker_indexes = check_for_failure(remote_values)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 42, in check_for_failure
f9g8vvlswy-algo-1-r36tx | ray.get(object_ref)
f9g8vvlswy-algo-1-r36tx | ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=242, ip=172.18.0.2, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fad2f577e80>)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/worker_group.py", line 26, in __execute
f9g8vvlswy-algo-1-r36tx | return func(*args, **kwargs)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/backend.py", line 498, in end_training
f9g8vvlswy-algo-1-r36tx | output = session.finish()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/session.py", line 102, in finish
f9g8vvlswy-algo-1-r36tx | func_output = self.training_thread.join()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 94, in join
f9g8vvlswy-algo-1-r36tx | raise self.exc
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/utils.py", line 87, in run
f9g8vvlswy-algo-1-r36tx | self.ret = self._target(*self._args, **self._kwargs)
f9g8vvlswy-algo-1-r36tx | File "train_ray.py", line 162, in train_func
f9g8vvlswy-algo-1-r36tx | for i, batch in enumerate(train_loader):
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/ray/train/torch.py", line 192, in __iter__
f9g8vvlswy-algo-1-r36tx | for item in iterator:
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 525, in __next__
f9g8vvlswy-algo-1-r36tx | (data, worker_id) = self._next_data()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1273, in _next_data
f9g8vvlswy-algo-1-r36tx | return (self._process_data(data), w_id)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1299, in _process_data
f9g8vvlswy-algo-1-r36tx | data.reraise()
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
f9g8vvlswy-algo-1-r36tx | raise self.exc_type(msg)
f9g8vvlswy-algo-1-r36tx | RuntimeError: Caught RuntimeError in DataLoader worker process 0.
f9g8vvlswy-algo-1-r36tx | Original Traceback (most recent call last):
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 210, in _worker_loop
f9g8vvlswy-algo-1-r36tx | data = fetcher.fetch(index)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
f9g8vvlswy-algo-1-r36tx | return self.collate_fn(data)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in default_collate
f9g8vvlswy-algo-1-r36tx | return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 83, in <listcomp>
f9g8vvlswy-algo-1-r36tx | return [default_collate(samples) for samples in transposed]
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 53, in default_collate
f9g8vvlswy-algo-1-r36tx | storage = elem.storage()._new_shared(numel)
f9g8vvlswy-algo-1-r36tx | File "/opt/conda/lib/python3.6/site-packages/torch/storage.py", line 157, in _new_shared
f9g8vvlswy-algo-1-r36tx | return cls._new_using_fd(size)
f9g8vvlswy-algo-1-r36tx | RuntimeError: unable to write to file </torch_1602_2842463136>