Thank you @kai for the suggestions.
- What happens if you only use one worker (and one CPU per worker)?
Can you guide me with setting the parameters for this? should the trainer be modified as show below?
here python3 train_tuner.py --num_workers 1 --num_gpus 4
trainer = Trainer(
"torch",
num_workers=args.num_workers,
use_gpu=True,
resources_per_worker={"GPU": args.num_gpus, "CPU": 1},
)
- What happens if you use
amp=False
?
I have revised the gist. Updated gist with print statements.
Now there is an error when I set amp=True
and another error when I set amp=False
.
Case 1: amp = True
(TrainTrainable pid=143785) File "train_tuner.py", line 168, in train_tuner
(TrainTrainable pid=143785) model = train.torch.prepare_model(model)
(TrainTrainable pid=143785) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 596, in prepare_model
(TrainTrainable pid=143785) return get_accelerator(TorchAccelerator).prepare_model(
(TrainTrainable pid=143785) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 122, in prepare_model
(TrainTrainable pid=143785) assert not hasattr(model, "__getstate__")
Case 2 : amp=False
(BackendExecutor pid=291030) File "train_tuner.py", line 193, in train_tuner
(BackendExecutor pid=291030) x, y = next(iter(trainloader))
(BackendExecutor pid=291030) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 513, in __next__
(BackendExecutor pid=291030) self._wait_for_batch(next_batch)
(BackendExecutor pid=291030) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 495, in _wait_for_batch
(BackendExecutor pid=291030) i.record_stream(curr_stream)
(BackendExecutor pid=291030) AttributeError: 'list' object has no attribute 'record_stream'
(TrainTrainable pid=290838) x, y = next(iter(trainloader))
(TrainTrainable pid=290838) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 513, in __next__
(TrainTrainable pid=290838) self._wait_for_batch(next_batch)
(TrainTrainable pid=290838) File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 495, in _wait_for_batch
(TrainTrainable pid=290838) i.record_stream(curr_stream)
(TrainTrainable pid=290838) AttributeError: 'list' object has no attribute 'record_stream'
Focusing on amp=False case first, i added debugging statements
# before ray wrapping
x, y = next(iter(trainloader))
print(type(x), type(y))
print(x.shape, y)
# required for ray train
trainloader = train.torch.prepare_data_loader(trainloader)
valloader = train.torch.prepare_data_loader(valloader)
# after ray wrapping
x, y = next(iter(trainloader))
print(type(x), type(y))
print(x.shape, y)
Output of dataset and error show above in case of amp=False
:
(BaseWorkerMixin pid=261004) <class 'torch.Tensor'> <class 'list'> #(input shape, label shape)
(BaseWorkerMixin pid=261004) torch.Size([128, 1, 100, 270]) # input
(BaseWorkerMixin pid=261004) [{'labels': tensor([0, 0, 0, 0]), 'sinusoid': tensor([[0.0200, 0.1924, 0.8860],
(BaseWorkerMixin pid=261004) [0.3400, 0.2136, 0.9099],
(BaseWorkerMixin pid=261004) [0.5400, 0.2116, 0.9133],
(BaseWorkerMixin pid=261004) [0.9800, 0.2248, 0.9295]])}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([0]), 'sinusoid': tensor([[0.3700, 0.4236, 0.9242]])}, {'labe
ls': tensor([0, 1]), 'sinusoid': tensor([[0.1000, 0.2037, 0.8158],
(BaseWorkerMixin pid=261004) [0.4200, 0.9570, 0.3725]])}, {'labels': tensor([0]), 'sinusoid': tensor([[0.7100, 0.3406, 0.8934]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.8700, 0.9206, 0.0691]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.5200, 0.
7303, 0.2487]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.4400, 0.9757, 0.0673]])}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([0, 0, 0]), 'sinusoid': tensor([[0.4
100, 0.2464, 0.8929],
Looking into source code https://github.com/ray-project/ray/blob/029517a037b1219423ab45af79db1e9296bc39c7/python/ray/train/torch.py#L486 it seems both image and labels should be tensor but in my case the input are tensors and labels are dict of Tensors as shown above.