I am using ray.tune to run ImpalaTrainer trainables. For some reason, gradient descent becomes unbearably slow using remote workers. However, local workers ray.init(local_mode=true) do not seem to have this problem. The chart below shows two identical runs, one with local_mode=True and one with loca…

I plotted the computation graph and reduced the complexity of my backwards pass by replacing: out = torch.cat([gnn_out[b, node_idx[b]] for b in range(batch)]) with gnn_out[torch.arange(B, device=flat.device), node_idx.squeeze()] reduces the gradient time by a factor of 10. However, running ray w…

To anyone else who runs into this issue. I seem to have solved the problem by decreasing the number of workers from 10. I have ample memory free, so it isn’t a disk swapping issue.

The issue is not the gradients or computation graph, but the loss computation. I’ve narrowed the problem down to the loss computation in torch_policy.py line 434 loss_out = force_list( self._loss(self, self.model, self.dist_class, train_batch)) This occurs both with and without vtrace. …

Just to follow up in case others run into this issue: The issue seems to be with pytorch , probably due to the GPU scheduler. I’ve found at some point, the models will simply experience a 10-100x increase in forward pass time. I’ve dropped into the debugger when the slowdown occurred and fed zeros t…

Very cool, thanks so much for digging into this @smorad and finding the bug on the torch end! And for updating the posts with the links @Bam4d !

@sven1977 I have done some more looking into this and I have come up with a workaround that may be the solution to this issue. Basically what I think is happening is the pytorch gpu scheduler needs to interrupt a thread to either read/write/execute GPU instructions. And to do this the thread needs…

time.sleep(0) should yield without adding any additional delays: https://stackoverflow.com/a/790246 I think adding this before the return with a comment explaining why would be an ideal solution. If the queue is full, then the thread should yield its quantum to a thread that can actually make use o…

Ah this is great, I did not know this :smiley: I’ll try this out and see if it works. If it works, ill link the PR here.

Very slow gradient descent on remote workers

RLlib

Bam4d May 3, 2021, 12:44pm 7

@sven1977 this is the same issue as in this discussion: [RLlib] Ray trains extremely slow when learner queue is full

Topic		Replies	Views
[RLlib] Ray trains extremely slow when learner queue is full RLlib	7	2289	May 3, 2021
Ray Train hangs for long time Ray Train	11	1912	July 20, 2022
Impala Bugs and some other observations RLlib	9	1151	April 27, 2023
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1202	January 12, 2022
My Ray programs stops learning when using distributed compute RLlib	10	1130	August 16, 2022

Very slow gradient descent on remote workers

Related topics