Ray Train hangs for long time

dudeperf3ct · May 31, 2022, 3:02pm

Thank you @kai for the suggestions.

What happens if you only use one worker (and one CPU per worker)?

Can you guide me with setting the parameters for this? should the trainer be modified as show below?

here python3 train_tuner.py --num_workers 1 --num_gpus 4

    trainer = Trainer(
        "torch",
        num_workers=args.num_workers,
        use_gpu=True,
        resources_per_worker={"GPU": args.num_gpus, "CPU": 1},
    )

What happens if you use amp=False ?

I have revised the gist. Updated gist with print statements.

Now there is an error when I set amp=True and another error when I set amp=False.

Case 1: amp = True

(TrainTrainable pid=143785)   File "train_tuner.py", line 168, in train_tuner
(TrainTrainable pid=143785)     model = train.torch.prepare_model(model)
(TrainTrainable pid=143785)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 596, in prepare_model
(TrainTrainable pid=143785)     return get_accelerator(TorchAccelerator).prepare_model(
(TrainTrainable pid=143785)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 122, in prepare_model
(TrainTrainable pid=143785)     assert not hasattr(model, "__getstate__")

Case 2 : amp=False

(BackendExecutor pid=291030)   File "train_tuner.py", line 193, in train_tuner                                                                                                                                                                                             
(BackendExecutor pid=291030)     x, y = next(iter(trainloader))                                                                                                                                                                                                            
(BackendExecutor pid=291030)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 513, in __next__                                                                                                                             
(BackendExecutor pid=291030)     self._wait_for_batch(next_batch)                                                                                                                                                                                                          
(BackendExecutor pid=291030)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 495, in _wait_for_batch                                                                                                                      
(BackendExecutor pid=291030)     i.record_stream(curr_stream)                                                                                                                                                                                                              
(BackendExecutor pid=291030) AttributeError: 'list' object has no attribute 'record_stream'                                                                                                                                                                                
(TrainTrainable pid=290838)     x, y = next(iter(trainloader))                                                                                                                                                                                                             
(TrainTrainable pid=290838)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 513, in __next__                                                                                                                              
(TrainTrainable pid=290838)     self._wait_for_batch(next_batch)                                                                                                                                                                                                           
(TrainTrainable pid=290838)   File "/home/deepkapha/anaconda3/envs/bop/lib/python3.8/site-packages/ray/train/torch.py", line 495, in _wait_for_batch                                                                                                                       
(TrainTrainable pid=290838)     i.record_stream(curr_stream)                                                                                                                                                                                                               
(TrainTrainable pid=290838) AttributeError: 'list' object has no attribute 'record_stream'

Focusing on amp=False case first, i added debugging statements

    # before ray wrapping
    x, y = next(iter(trainloader))
    print(type(x), type(y))
    print(x.shape, y)

    # required for ray train
    trainloader = train.torch.prepare_data_loader(trainloader)
    valloader = train.torch.prepare_data_loader(valloader)
    
    # after ray wrapping
    x, y = next(iter(trainloader))
    print(type(x), type(y))
    print(x.shape, y)

Output of dataset and error show above in case of amp=False:

(BaseWorkerMixin pid=261004) <class 'torch.Tensor'> <class 'list'>   #(input shape, label shape)                                                                                                                                                                                                      
(BaseWorkerMixin pid=261004) torch.Size([128, 1, 100, 270])    # input                                                                                                                                                                                                                
(BaseWorkerMixin pid=261004) [{'labels': tensor([0, 0, 0, 0]), 'sinusoid': tensor([[0.0200, 0.1924, 0.8860],                                                                                                                                                               
(BaseWorkerMixin pid=261004)         [0.3400, 0.2136, 0.9099],                                                                                                                                                                                                             
(BaseWorkerMixin pid=261004)         [0.5400, 0.2116, 0.9133],                                                                                                                                                                                                             
(BaseWorkerMixin pid=261004)         [0.9800, 0.2248, 0.9295]])}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([0]), 'sinusoid': tensor([[0.3700, 0.4236, 0.9242]])}, {'labe
ls': tensor([0, 1]), 'sinusoid': tensor([[0.1000, 0.2037, 0.8158],                                                                                                                                                                                                         
(BaseWorkerMixin pid=261004)         [0.4200, 0.9570, 0.3725]])}, {'labels': tensor([0]), 'sinusoid': tensor([[0.7100, 0.3406, 0.8934]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.8700, 0.9206, 0.0691]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.5200, 0.
7303, 0.2487]])}, {'labels': tensor([1]), 'sinusoid': tensor([[0.4400, 0.9757, 0.0673]])}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([]), 'sinusoid': tensor([], size=(0, 3))}, {'labels': tensor([0, 0, 0]), 'sinusoid': tensor([[0.4
100, 0.2464, 0.8929],

Looking into source code https://github.com/ray-project/ray/blob/029517a037b1219423ab45af79db1e9296bc39c7/python/ray/train/torch.py#L486 it seems both image and labels should be tensor but in my case the input are tensors and labels are dict of Tensors as shown above.

Topic		Replies	Views
Ray Tune training hangs Ray Tune	3	1256	October 18, 2023
[Tune] Ray tune for multi gpu and multi node runs Hangs	2	590	August 26, 2023
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	74	July 10, 2024
Horovod Trainer hangs Ray Train	5	593	November 3, 2023
Adding memory in resources_per_trial in tune.run() hangs	2	400	October 28, 2022

Ray Train hangs for long time

Related topics