RaySGD PyTorch fail: "TypeError: can't pickle SSLContext objects"


I have a simple PyTorch for loop working fine in a single GPU of a ml.g4dn.12xlarge SageMaker Notebook instance (4x NVIDIA T4)

I’m following that Ray tutorial to scale it over 4 GPUs:

from ray.util.sgd.v2 import Trainer

trainer = Trainer(backend="torch", num_workers=4)
results = trainer.run(train_function)

I see 4 processes starting but get pretty quickly this error: “TypeError: can’t pickle SSLContext objects”

Traceback (most recent call last):
  File "train.py", line 289, in <module>
    results = trainer.run(train_single)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 238, in run
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 511, in __init__
    train_func, checkpoint, checkpoint_strategy, run_dir=run_dir)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 526, in _start_training
    lambda: self._executor.start_training(
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 537, in _run_with_error_handling
    return func()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 531, in <lambda>
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/backends/backend.py", line 435, in start_training
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/worker_group.py", line 267, in execute_single_async
    func, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 118, in remote
    return self._remote(args, kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 408, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 160, in _remote
    return invocation(args, kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 154, in invocation
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 916, in _actor_method_call
    list_args, name, num_returns, self._ray_actor_method_cpus)
  File "python/ray/_raylet.pyx", line 1525, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 1530, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.prepare_args
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 348, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 328, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 288, in _serialize_to_pickle5
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 285, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
    return Pickler.dump(self, obj)
TypeError: can't pickle SSLContext objects

Sounds familiar to anyone? How to use Ray to distribute PyTorch training code that works well on one card?

Hey @Lacruche can you do a pip install pickle5 and see if that works for you?

Also are you running this on sagemaker? There have been some pickling issues with using Ray on sagemaker in the past HyperParameter search in sagemaker · Issue #13005 · huggingface/transformers · GitHub

thanks let me try that when I’m back on this project. And yes, it was on SageMaker

just a pip install pickle5? what is it expected to do? I installed it and restarted my notebook but still have the TypeError: can’t pickle SSLContext objects

Here is what I believe happens:

outside of the train_function I define my dataset class, which itself creates a boto3 resource with boto3.resource("s3"). When Ray Train launches the data parallel training, it will create copies of that script and try to serialize around the dataset class. Unfortunately, SSLContext objects (which I believe are created by boto3 to talk with the AWS cloud) are known to have issues with pickle serialization (python - How to pickle a ssl.SSLContext object - Stack Overflow).

So I moved the dataset class instantiation to the train_func (so that different boto3 resources are created in each Ray worker), and the problem disappeared.

Happy if someone from the Ray team can confirm the above hypothesis - though not urgent as my problem is solved.

Ah great diagnosis! Yes moving the dataset instantiation to inside the train func sounds right!

1 Like