RaySGD PyTorch fail: "TypeError: can't pickle SSLContext objects"

Lacruche · October 25, 2021, 7:24pm

Hi,

I have a simple PyTorch for loop working fine in a single GPU of a ml.g4dn.12xlarge SageMaker Notebook instance (4x NVIDIA T4)

I’m following that Ray tutorial to scale it over 4 GPUs:

from ray.util.sgd.v2 import Trainer

trainer = Trainer(backend="torch", num_workers=4)
trainer.start()
results = trainer.run(train_function)
trainer.shutdown()

I see 4 processes starting but get pretty quickly this error: “TypeError: can’t pickle SSLContext objects”

Traceback (most recent call last):
  File "train.py", line 289, in <module>
    results = trainer.run(train_single)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 238, in run
    run_dir=self.latest_run_dir,
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 511, in __init__
    train_func, checkpoint, checkpoint_strategy, run_dir=run_dir)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 526, in _start_training
    lambda: self._executor.start_training(
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 537, in _run_with_error_handling
    return func()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/trainer.py", line 531, in <lambda>
    latest_checkpoint_id=latest_checkpoint_id
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/backends/backend.py", line 435, in start_training
    checkpoint=checkpoint_dict))
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/sgd/v2/worker_group.py", line 267, in execute_single_async
    func, *args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 118, in remote
    return self._remote(args, kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 408, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 160, in _remote
    return invocation(args, kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 154, in invocation
    num_returns=num_returns)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/actor.py", line 916, in _actor_method_call
    list_args, name, num_returns, self._ray_actor_method_cpus)
  File "python/ray/_raylet.pyx", line 1525, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 1530, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.prepare_args
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 348, in serialize
    return self._serialize_to_msgpack(value)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 328, in _serialize_to_msgpack
    self._serialize_to_pickle5(metadata, python_objects)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 288, in _serialize_to_pickle5
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/serialization.py", line 285, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 580, in dump
    return Pickler.dump(self, obj)
TypeError: can't pickle SSLContext objects

Sounds familiar to anyone? How to use Ray to distribute PyTorch training code that works well on one card?

amogkam · December 6, 2021, 6:24pm

Hey @Lacruche can you do a pip install pickle5 and see if that works for you?

Also are you running this on sagemaker? There have been some pickling issues with using Ray on sagemaker in the past HyperParameter search in sagemaker · Issue #13005 · huggingface/transformers · GitHub

Lacruche · December 10, 2021, 8:16am

thanks let me try that when I’m back on this project. And yes, it was on SageMaker

Lacruche · January 6, 2022, 11:21pm

just a pip install pickle5? what is it expected to do? I installed it and restarted my notebook but still have the TypeError: can’t pickle SSLContext objects

Lacruche · January 6, 2022, 11:36pm

Here is what I believe happens:

outside of the train_function I define my dataset class, which itself creates a boto3 resource with boto3.resource("s3"). When Ray Train launches the data parallel training, it will create copies of that script and try to serialize around the dataset class. Unfortunately, SSLContext objects (which I believe are created by boto3 to talk with the AWS cloud) are known to have issues with pickle serialization (python - How to pickle a ssl.SSLContext object - Stack Overflow).

So I moved the dataset class instantiation to the train_func (so that different boto3 resources are created in each Ray worker), and the problem disappeared.

Happy if someone from the Ray team can confirm the above hypothesis - though not urgent as my problem is solved.

amogkam · January 7, 2022, 5:30pm

Ah great diagnosis! Yes moving the dataset instantiation to inside the train func sounds right!

Topic		Replies	Views
Ray train examples are broken Ray Train	1	598	May 10, 2022
CUDA error: all CUDA-capable devices are busy or unavailable Ray Tune	4	1796	February 11, 2022
Pytorch+ray train example not working Ray Train	4	784	November 9, 2023
Running methods with actors is slower than running normal methods Ray Core	10	681	May 24, 2021
RaySGD fails to find GPUs Ray Train	1	467	December 6, 2021

RaySGD PyTorch fail: "TypeError: can't pickle SSLContext objects"

Related topics