Ray Train code works locally, not in SageMaker PyTorch job

Hi,

I have a Ray Train PyTorch script that runs fine in an EC2 g4.12xlarge (4 T4, SageMaker Notebook instance)

The same Ray Train script, launch in a SageMaker PyTorch Training container
prints the below logs, then has been silent for 15min, with CPU at 350% and GPU 0%…

Both the SageMaker EC2 notebook instance and the Training job EC2 use pt1.8.1 and Ray 1.9.1

Invoking script with the following command:

/opt/conda/bin/python3.6 train-ray.py --batch 12 --bucket xxxxxxxxxxxxxxx --cache /opt/ml/input/data/dataset --epochs 10 --height 604 --log-freq 500 --lr 0.183 --lr_decay_per_epoch 0.3 --lr_warmup_ratio 0.1 --momentum 0.928 --prefetch 1 --width 960 --workers 12

#033[2m#033[36m(BaseWorkerMixin pid=165)#033[0m 
#033[2m#033[36m(BaseWorkerMixin pid=165)#033[0m algo-1:165:1617 [0] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
#033[2m#033[36m(BaseWorkerMixin pid=165)#033[0m NCCL version 2.7.8+cuda11.1
#033[2m#033[36m(BaseWorkerMixin pid=206)#033[0m 
#033[2m#033[36m(BaseWorkerMixin pid=206)#033[0m algo-1:206:1620 [3] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported
#033[2m#033[36m(BaseWorkerMixin pid=210)#033[0m 
#033[2m#033[36m(BaseWorkerMixin pid=210)#033[0m algo-1:210:1619 [2] ofi_init:1136 NCCL WARN NET/OFI Only EFA provider is supported

Does that ring a bell to anyone? Why and when would some code work fine in SageMaker EC2 Notebook Instances, then hang forever in SageMaker Training EC2s?

Is it also hanging if you do CPU-only training, or if you use gloo backend instead of nccl?

I’m thinking it might be due to some nccl differences between the two.

how can I ask it to run on CPU alone? use_gpu=False?

I tried to switch to PT1.9.0, which gives me the error below everywhere (both in Notebook EC2 and Training EC2 this time!) it’s quite a mess :frowning:

by any chance do you happen to have samples or repos with successful use of Ray Train on EC2 or even better DLAMI?

(top of the stack trace only)

(BaseWorkerMixin pid=16881) 2022-01-11 20:50:23,782	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=16925) 2022-01-11 20:50:23,782	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=16886) 2022-01-11 20:50:23,782	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
(BaseWorkerMixin pid=16922) 2022-01-11 20:50:23,784	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
2022-01-11 20:50:24,959	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-11_20-50-19/run_001
2022-01-11 20:50:25,689	WARNING worker.py:1245 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 618, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 659, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 625, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 629, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 578, in ray._raylet.execute_task.function_executor
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 123, in train_func
    "pytorch/vision:v0.10.0", args.network, pretrained=False, num_classes=args.classes
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 362, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 162, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, branch)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 124, in _validate_not_a_forked_repo
    with urlopen(url) as r:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: rate limit exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 759, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 714, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1854, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 317, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/exceptions.py", line 22, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
TypeError: cannot serialize '_io.BufferedReader' object
An unexpected internal error occurred while the worker was executing a task.
2022-01-11 20:50:25,689	WARNING worker.py:1245 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 618, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 659, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 625, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 629, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 578, in ray._raylet.execute_task.function_executor
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 123, in train_func
    "pytorch/vision:v0.10.0", args.network, pretrained=False, num_classes=args.classes
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 362, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 162, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, branch)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 124, in _validate_not_a_forked_repo
    with urlopen(url) as r:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: rate limit exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 759, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 714, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1854, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 317, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/exceptions.py", line 22, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
TypeError: cannot serialize '_io.BufferedReader' object
An unexpected internal error occurred while the worker was executing a task.
2022-01-11 20:50:25,689	WARNING worker.py:1245 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 618, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 659, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 625, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 629, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 578, in ray._raylet.execute_task.function_executor
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 123, in train_func
    "pytorch/vision:v0.10.0", args.network, pretrained=False, num_classes=args.classes
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 362, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 162, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, branch)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 124, in _validate_not_a_forked_repo
    with urlopen(url) as r:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: rate limit exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 759, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 714, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1854, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 317, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/exceptions.py", line 22, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
TypeError: cannot serialize '_io.BufferedReader' object
An unexpected internal error occurred while the worker was executing a task.
2022-01-11 20:50:25,689	WARNING worker.py:1245 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 618, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 659, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 625, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 629, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 578, in ray._raylet.execute_task.function_executor
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 123, in train_func
    "pytorch/vision:v0.10.0", args.network, pretrained=False, num_classes=args.classes
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 362, in load
    repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 162, in _get_cache_or_reload
    _validate_not_a_forked_repo(repo_owner, repo_name, branch)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 124, in _validate_not_a_forked_repo
    with urlopen(url) as r:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: rate limit exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 759, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 714, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1854, in ray._raylet.CoreWorker.store_task_outputs
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 317, in _serialize_to_msgpack
    value = value.to_bytes()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/exceptions.py", line 22, in to_bytes
    serialized_exception=pickle.dumps(self),
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
TypeError: cannot serialize '_io.BufferedReader' object
An unexpected internal error occurred while the worker was executing a task.

(BaseWorkerMixin pid=16881) 2022-01-11 20:50:25,681	ERROR worker.py:431 -- SystemExit was raised from the worker
(BaseWorkerMixin pid=16881) Traceback (most recent call last):
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 618, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 659, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 625, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 629, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 578, in ray._raylet.execute_task.function_executor
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/function_manager.py", line 609, in actor_method_executor
(BaseWorkerMixin pid=16881)     return method(__ray_actor, *args, **kwargs)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(BaseWorkerMixin pid=16881)     return method(self, *_args, **_kwargs)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
(BaseWorkerMixin pid=16881)     return func(*args, **kwargs)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
(BaseWorkerMixin pid=16881)     output = session.finish()
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
(BaseWorkerMixin pid=16881)     func_output = self.training_thread.join()
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
(BaseWorkerMixin pid=16881)     raise self.exc
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
(BaseWorkerMixin pid=16881)     self.ret = self._target(*self._args, **self._kwargs)
(BaseWorkerMixin pid=16881)   File "a2d2_code/train-ray.py", line 123, in train_func
(BaseWorkerMixin pid=16881)     "pytorch/vision:v0.10.0", args.network, pretrained=False, num_classes=args.classes
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 362, in load
(BaseWorkerMixin pid=16881)     repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 162, in _get_cache_or_reload
(BaseWorkerMixin pid=16881)     _validate_not_a_forked_repo(repo_owner, repo_name, branch)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/hub.py", line 124, in _validate_not_a_forked_repo
(BaseWorkerMixin pid=16881)     with urlopen(url) as r:
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 222, in urlopen
(BaseWorkerMixin pid=16881)     return opener.open(url, data, timeout)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 531, in open
(BaseWorkerMixin pid=16881)     response = meth(req, response)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 641, in http_response
(BaseWorkerMixin pid=16881)     'http', request, response, code, msg, hdrs)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 569, in error
(BaseWorkerMixin pid=16881)     return self._call_chain(*args)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 503, in _call_chain
(BaseWorkerMixin pid=16881)     result = func(*args)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/urllib/request.py", line 649, in http_error_default
(BaseWorkerMixin pid=16881)     raise HTTPError(req.full_url, code, msg, hdrs, fp)
(BaseWorkerMixin pid=16881) urllib.error.HTTPError: HTTP Error 403: rate limit exceeded
(BaseWorkerMixin pid=16881) 
(BaseWorkerMixin pid=16881) During handling of the above exception, another exception occurred:
(BaseWorkerMixin pid=16881) 
(BaseWorkerMixin pid=16881) Traceback (most recent call last):
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 759, in ray._raylet.task_execution_handler
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 580, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 714, in ray._raylet.execute_task
(BaseWorkerMixin pid=16881)   File "python/ray/_raylet.pyx", line 1854, in ray._raylet.CoreWorker.store_task_outputs
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
(BaseWorkerMixin pid=16881)     return self._serialize_to_msgpack(value)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 317, in _serialize_to_msgpack
(BaseWorkerMixin pid=16881)     value = value.to_bytes()
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/exceptions.py", line 22, in to_bytes
(BaseWorkerMixin pid=16881)     serialized_exception=pickle.dumps(self),
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
(BaseWorkerMixin pid=16881)     cp.dump(obj)
(BaseWorkerMixin pid=16881)   File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
(BaseWorkerMixin pid=16881)     return Pickler.dump(self, obj)
(BaseWorkerMixin pid=16881) TypeError: cannot serialize '_io.BufferedReader' object

Yep for CPU training you just have to specify use_gpu=False.

It seems like the error you are seeing with Torch hub rate limiting is a known issue with PyTorch 1.9: HTTP Error 403: rate limit exceeded when loading model · Issue #4156 · pytorch/vision · GitHub, but works with torch 1.8.

Regarding EC2, Ray Train should work, and I have personally successfully ran multi-node GPU training on EC2 with the Deep Learning AMI.

1 Like

+1 to what Amog has already said. While our tests and examples run on EC2, they may feel excessive. Here’s a really simple example that you should be able to use to confirm that Ray Train works on EC2 (with GPU)!

  1. torch-multi-gpu.yaml:
cluster_name: torch-multi-gpu

max_workers: 2

provider:
    type: aws
    region: us-west-1

auth:
    ssh_user: ubuntu

available_node_types:
    2_gpu_node: 
        min_workers: 1
        max_workers: 2
        node_config:
            InstanceType: g3.8xlarge
            ImageId: latest_dlami
        resources: {}

head_node_type: 2_gpu_node


setup_commands:
    - pip install -U ray torch 
  1. gpu-example.py:
import ray
import ray.train as train
from ray.train import Trainer
import torch

from ray.train.torch import TorchConfig

def train_func():
    # Setup model.
    model = torch.nn.Linear(1, 1)
    model = train.torch.prepare_model(model)
    loss_fn = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

    # Setup data.
    input = torch.randn(1000, 1)
    labels = input * 2
    dataset = torch.utils.data.TensorDataset(input, labels)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
    dataloader = train.torch.prepare_data_loader(dataloader)

    # Train.
    for _ in range(5):
        for X, y in dataloader:
            pred = model(X)
            loss = loss_fn(pred, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    print(model.state_dict())

ray.init(address="auto")
trainer = Trainer(backend=TorchConfig(), num_workers=4, use_gpu=True)
trainer.start()
trainer.run(train_func)
trainer.shutdown()
~ ray up torch-multi-gpu.yaml
...
~ ray submit torch-multi-gpu.yaml gpu-example.py
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 13.56.76.180
Fetched IP: 13.56.76.180
2022-01-11 23:50:33,462	INFO worker.py:843 -- Connecting to existing Ray cluster at address: 10.0.1.144:6379
2022-01-11 23:50:33,605	INFO trainer.py:172 -- Trainer logs will be logged in: /home/ubuntu/ray_results/train_2022-01-11_23-50-33
(BaseWorkerMixin pid=44058) 2022-01-11 23:50:37,722	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4]
(BaseWorkerMixin pid=44057) 2022-01-11 23:50:37,723	INFO torch.py:67 -- Setting up process group for: env:// [rank=1, world_size=4]
(BaseWorkerMixin pid=5629, ip=10.0.1.134) 2022-01-11 23:50:37,721	INFO torch.py:67 -- Setting up process group for: env:// [rank=3, world_size=4]
(BaseWorkerMixin pid=5628, ip=10.0.1.134) 2022-01-11 23:50:37,735	INFO torch.py:67 -- Setting up process group for: env:// [rank=2, world_size=4]
2022-01-11 23:50:38,746	INFO trainer.py:178 -- Run results will be logged in: /home/ubuntu/ray_results/train_2022-01-11_23-50-33/run_001
(BaseWorkerMixin pid=5629, ip=10.0.1.134) 2022-01-11 23:50:38,788	INFO torch.py:239 -- Moving model to device: cuda:1
(BaseWorkerMixin pid=5628, ip=10.0.1.134) 2022-01-11 23:50:38,788	INFO torch.py:239 -- Moving model to device: cuda:0
(BaseWorkerMixin pid=44058) 2022-01-11 23:50:38,862	INFO torch.py:239 -- Moving model to device: cuda:0
(BaseWorkerMixin pid=44057) 2022-01-11 23:50:38,782	INFO torch.py:239 -- Moving model to device: cuda:1
(BaseWorkerMixin pid=44057) 2022-01-11 23:50:42,098	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=5629, ip=10.0.1.134) 2022-01-11 23:50:42,097	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=5628, ip=10.0.1.134) 2022-01-11 23:50:42,085	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=44058) 2022-01-11 23:50:42,126	INFO torch.py:242 -- Wrapping provided model in DDP.
(BaseWorkerMixin pid=5629, ip=10.0.1.134) OrderedDict([('module.weight', tensor([[0.7112]], device='cuda:1')), ('module.bias', tensor([0.3216], device='cuda:1'))])
(BaseWorkerMixin pid=5628, ip=10.0.1.134) OrderedDict([('module.weight', tensor([[0.7112]], device='cuda:0')), ('module.bias', tensor([0.3216], device='cuda:0'))])
(BaseWorkerMixin pid=44058) OrderedDict([('module.weight', tensor([[0.7112]], device='cuda:0')), ('module.bias', tensor([0.3216], device='cuda:0'))])
(BaseWorkerMixin pid=44057) OrderedDict([('module.weight', tensor([[0.7112]], device='cuda:1')), ('module.bias', tensor([0.3216], device='cuda:1'))])
Shared connection to 13.56.76.180 closed.

I’ve also used this as an opportunity to create a Github Issue to upstream this into our docs: [train] add quickstart cluster tutorial · Issue #21541 · ray-project/ray · GitHub

ok good to know thanks
On AWS-managed PyTorch containers in SageMaker I’m basically blocked both with 1.9 (error above), and with 1.8 it’s silent for 30min with GPU idle ; I’ll deep dive more into it and let you know what I find

thanks super useful!

out of curiosity do you have Ray Train PT GPU tests on G4 and G5?
those are more recent instances popular for training (very cost-efficient)
G3 is a bit old

Yep, at the very least our per-commit tests run on g4dn.12xlarge instances.

1 Like

For pytorch 1.8, is this behavior the same if you do CPU only training or if you use gloo instead of nccl?

For specifying gloo, you would create a TorchConfig object and pass that into your Trainer like so

from ray.train import Trainer
from ray.train.torch import TorchConfig

trainer = Trainer(backend=TorchConfig(...), num_workers=2)

trying gloo right now

btw when I run htop during Ray Train with gloo I see a lot of “Ray:IDLE” processes ; what is that?

do we need ray.init() ? it’s not mentioned here Ray Train User Guide — Ray v1.9.2

I don’t manage to run with Gloo, even locally on the Nb (where NCCL works). It takes forever (15min idle with no logs), even with batch 1 and 1 iteration, with CPU idle

oh actually just got a result ; I guess GPU make a bigger speed difference than expected vs CPU on seg! let me try in the remote container now

same problem as NCCL in SageMaker PyTorch DLC: Gloo + Ray is silent for 10min. Not a single log (not even the

(BaseWorkerMixin pid=30232) 2022-01-12 13:31:23,744	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
(BaseWorkerMixin pid=30232) Using cache found in /home/ec2-user/.cache/torch/hub/pytorch_vision_v0.9.1
(BaseWorkerMixin pid=30232) worker rank 0 is using GPU cpu
(BaseWorkerMixin pid=30232) In epoch 0 learning rate: 0.0100000000

I should be getting after few seconds… SageMaker doesn’t seem too friendly to Ray Train yet :frowning: i’ll try to get the attention of colleagues on this

ok so here are the results of tests:

test 1: Gloo on CPUs of a C5.9xlarge instance, PT 1.8.1:

There is 40min between the python script.py and the apparation of Ray Train logs in cloudwatch ((BaseWorkerMixin pid=199)#033[0m 2022-01-12 14:01:53,841#011INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=4] etc etc). After a while it fails with several cryptic errors, that again didn’t happen on the Notebook EC2:

#033[2m#033[36m(BackendExecutor pid=221)#033[0m OSError: [Errno 39] Directory not empty: 'gradle_scripts'

a bit lower I see a
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

a bit lower I see
#033[2m#033[36m(BackendExecutor pid=221)#033[0m OSError: [Errno 39] Directory not empty: 'videos'

and

#033[2m#033[36m(BackendExecutor pid=221)#033[0m OSError: [Errno 39] Directory not empty: 'android'

note that this is a fresh ephemeral EC2 so I have no clue what those directories refer to

test 2 Gloo on the CPUs of a G4 instance, PT 1.8.1

same problem. Ray Train logs take 30min to appear in cloudwatch after the container logs arrived, and then training errors with weird giant stack trace, including

#033[2m#033[36m(BackendExecutor pid=228)#033[0m OSError: [Errno 39] Directory not empty: '/root/.cache/torch/hub/vision-0.9.1/'
    output = session.finish()

#033[2m#033[36m(BackendExecutor pid=228)#033[0m OSError: [Errno 39] Directory not empty: 'internal'

#033[2m#033[36m(BackendExecutor pid=228)#033[0m FileNotFoundError: [Errno 2] No such file or directory: '.coveragerc'

etc etc