Ray Train creates TypeError: 'generator' object is not subscriptable

Lacruche · January 6, 2022, 11:48pm

Hi,

I have this PyTorch scripts that runs fine on a single GPU.
I’m trying to parallelize it with Ray Train

I followed this tutorial here Ray Train: Distributed Deep Learning — Ray v1.9.1 and added a train_loader = train.torch.prepare_data_loader(train_loader) and a train.torch.prepare_model(model)

but now, I’m getting this:

(BackendExecutor pid=38836)   File "a2d2_code/train_ray.py", line 163, in train_func
(BackendExecutor pid=38836)     inputs = batch[0].to(device)
(BackendExecutor pid=38836) TypeError: 'generator' object is not subscriptable

I’m surprised: how did batch become a generator? In my PyTorch script this line worked well (I need that [0] because my dataset __getitem__ returns a pair torch.div(image, 255), mask.type(torch.int64)

Does Ray’s train.torch.prepare_data_loader(train_loader) transforms the behavior of the data pipeline, and requires specific caution or contract in how we write datasets, dataloaders and training loops?

matthewdeng · January 7, 2022, 12:26am

Hmm, the observed behavior is indeed strange, prepare_data_loader shouldn’t modify the underlying data.

One thing that stands out to me in the Python trace is that the error is occurring in BackendExecutor, while it is expected that train_func is actually executed by BaseWorkerMixin. Could you share some more of the trace or the script that was executed?

Lacruche · January 7, 2022, 3:36pm

Thanks! I restarted my machine, and for some reason the problem disappeared…
I now have another trouble Ray Train silent for 7 min

Lacruche · January 7, 2022, 3:46pm

Actually @matthewdeng when I’m using num_workers=1 the problem comes back ; full trace below:

(BaseWorkerMixin pid=48664) 2022-01-07 15:44:27,887	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
2022-01-07 15:44:28,137	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_15-44-23/run_001
(BaseWorkerMixin pid=48664) Downloading: "https://github.com/pytorch/vision/archive/v0.9.1.zip" to /home/ec2-user/.cache/torch/hub/v0.9.1.zip
(BaseWorkerMixin pid=48664) 2022-01-07 15:44:33,424	INFO torch.py:239 -- Moving model to device: cuda:0
(BaseWorkerMixin pid=48664) In epoch 0 learning rate: 0.0100000000
Traceback (most recent call last):
  File "a2d2_code/train_ray.py", line 221, in <module>
    trainer.run(train_func)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 281, in run
    for intermediate_result in iterator:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 651, in __next__
    self._finish_training)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
    return func()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 721, in _finish_training
    return ray.get(self._backend_executor_actor.finish_training.remote())
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::BackendExecutor.finish_training() (pid=48669, ip=172.16.59.122, repr=<ray.train.backend.BackendExecutor object at 0x7fc1688c6550>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 507, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
    success, failed_worker_indexes = check_for_failure(remote_values)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 42, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(TypeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=48664, ip=172.16.59.122, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7f896b256dd0>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train_ray.py", line 170, in train_func
    inputs = batch[0].to(device)
TypeError: 'generator' object is not subscriptable
CPU times: user 324 ms, sys: 70.5 ms, total: 394 ms
Wall time: 27.8 s

matthewdeng · January 7, 2022, 4:50pm

When running this with just 1 worker, what happens if you remove the prepare_data_loader step?

Generally this should be the equivalent as just running your original train_func is a separate process.

Lacruche · January 7, 2022, 5:11pm

then I have a different error (that I didn’t have without Ray Train):

(BaseWorkerMixin pid=6809) 2022-01-07 17:07:20,564	INFO torch.py:67 -- Setting up process group for: env:// [rank=0, world_size=1]
2022-01-07 17:07:21,057	INFO trainer.py:178 -- Run results will be logged in: /home/ec2-user/ray_results/train_2022-01-07_17-07-16/run_001
(BaseWorkerMixin pid=6809) Using cache found in /home/ec2-user/.cache/torch/hub/pytorch_vision_v0.9.1
(BaseWorkerMixin pid=6809) 2022-01-07 17:07:21,880	INFO torch.py:239 -- Moving model to device: cuda:0
(BaseWorkerMixin pid=6809) In epoch 0 learning rate: 0.0100000000
Traceback (most recent call last):
  File "a2d2_code/train-ray.py", line 211, in <module>
    results = trainer.run(train_func)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 281, in run
    for intermediate_result in iterator:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 651, in __next__
    self._finish_training)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
    return func()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 721, in _finish_training
    return ray.get(self._backend_executor_actor.finish_training.remote())
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=6793, ip=172.16.59.122, repr=<ray.train.backend.BackendExecutor object at 0x7f85cbf9de50>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 507, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
    success, failed_worker_indexes = check_for_failure(remote_values)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 42, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=6809, ip=172.16.59.122, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fe9a4080690>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 168, in train_func
    outputs = model(inputs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torchvision/models/segmentation/_utils.py", line 19, in forward
    features = self.backbone(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torchvision/models/_utils.py", line 63, in forward
    x = module(x)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

Lacruche · January 7, 2022, 5:15pm

@matthewdeng I also tried to replace

inputs = batch[0]
masks = batch[1]

by inputs, masks = batch (I use this dataset class here that returns 2 objects inputs, masks)

But then I get a device type mismatch:

Traceback (most recent call last):
  File "a2d2_code/train-ray.py", line 213, in <module>
    results = trainer.run(train_func)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 281, in run
    for intermediate_result in iterator:
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 651, in __next__
    self._finish_training)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 620, in _run_with_error_handling
    return func()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/trainer.py", line 721, in _finish_training
    return ray.get(self._backend_executor_actor.finish_training.remote())
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.finish_training() (pid=14314, ip=172.16.59.122, repr=<ray.train.backend.BackendExecutor object at 0x7ff17e0ab110>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 507, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 526, in get_with_failure_handling
    success, failed_worker_indexes = check_for_failure(remote_values)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 42, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::BaseWorkerMixin._BaseWorkerMixin__execute() (pid=14251, ip=172.16.59.122, repr=<ray.train.worker_group.BaseWorkerMixin object at 0x7fd478e49890>)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/backend.py", line 498, in end_training
    output = session.finish()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/session.py", line 102, in finish
    func_output = self.training_thread.join()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 94, in join
    raise self.exc
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/ray/train/utils.py", line 87, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "a2d2_code/train-ray.py", line 187, in train_func
    val_loss = CE(outputs["out"], masks)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1048, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 2693, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 2390, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'target' in call to _thnn_nll_loss2d_forward

are you sure we need to delete the code copying tensors to device when using Ray Train as done here?

amogkam · January 7, 2022, 5:40pm

@Lacruche the problem is that in prepare_data_loader we wrap your dataloader to automatically move the data to the right device so that you don’t have to: ray/torch.py at master · ray-project/ray · GitHub.

However, this wrapped data loader is only iterable and not indexable. It should be pretty straightforward to add a __getitem__ method to the wrapped dataloader to support this. I will make a PR for this.

Alternatively, you can not use prepare_data_loader. But if you choose not to use prepare_data_loader, then you are responsible for moving the data to the correct devices and adding the DistributedSampler in your training function.

amogkam · January 7, 2022, 5:58pm

Sorry, ignore what I said- I misunderstood your code.

You are iterating through the DataLoader, but there is a bug in our _WrappedDataLoader where we return a generator instead of a tuple as intended: ray/torch.py at master · ray-project/ray · GitHub. So indexing into the batch will fail. I will make a PR to fix this, thanks for pointing this out!

Lacruche · January 7, 2022, 6:05pm

I changed how I extract the data from the batch object (thanks Antoni Yard1 Baum for the help on Slack), and manually move the data to the right GPU:

BEFORE (not working)

train_loader = train.torch.prepare_data_loader(train_loader)
inputs = batch[0]
masks = batch[1]

AFTER (working)

train_loader = train.torch.prepare_data_loader(train_loader, move_to_device=False)
device = train.torch.get_device()

inputs, masks = batch
inputs = inputs.to(device)
masks = masks.to(device)

amogkam · January 7, 2022, 6:06pm

Great, that works as well!

Topic		Replies	Views
Ray Tune changes the behaviour of my train function Ray Tune	7	394	January 18, 2021
Ray train examples are broken Ray Train	1	598	May 10, 2022
Ray Train v1.9.1: returns an AttributeError: module 'ray.train' has no attribute 'torch' Ray Train	1	1905	December 29, 2021
Ray Train RuntimeError: unable to write to file </torch_1602_2842463136> Ray Train	3	1132	January 7, 2022
Ray Train silent for 7 min Ray Train	1	466	January 7, 2022

Ray Train creates TypeError: 'generator' object is not subscriptable

Related topics