Very slow gradient descent on remote workers

smorad · March 16, 2021, 1:29pm

I am using ray.tune to run ImpalaTrainer trainables. For some reason, gradient descent becomes unbearably slow using remote workers. However, local workers ray.init(local_mode=true) do not seem to have this problem. The chart below shows two identical runs, one with local_mode=True and one with local_mode=False. I use ray.tune to run them:

Looking at the relative wall-clock time, it seems like the grad time starts reasonably but keeps increasing on each run:

Config:

base_model = {
    "custom_model": RayObsGraph,
    "custom_model_config": {
    },
    "max_seq_len": 8,
}
rnn_model = {
    "use_lstm": True,
    "max_seq_len": 8,
    "lstm_cell_size": 16
}
    "ray": {
        "env_config": {"dim": 4, "max_items": 4, "max_queries": 4},
        # These are rllib/ray specific
        "framework": "torch",
        "model": grid_search([base_model, rnn_model),
        "num_workers": 16,
        # Total GPU usage: num_gpus (trainer proc) + num_gpus_per_worker (workers)
        "num_cpus_per_worker": 2,
        "num_envs_per_worker": 1,
        # this corresponds to the number of learner GPUs used,
        # not the total used for the environments/rollouts
        "num_gpus": 1,
        # Size of batches (in timesteps) placed in the learner queue
        "rollout_fragment_length": 16,
        # Total number of timesteps to train per batch
        "train_batch_size": 512,
        "lr": 0.0001,
        "env": RecallEnv.__name__,
    },

I use a custom TorchV2Model model that utilizes state, so I’m wondering if this has to do with computing gradients from past states.

smorad · March 17, 2021, 5:44pm

I plotted the computation graph and reduced the complexity of my backwards pass by replacing:
out = torch.cat([gnn_out[b, node_idx[b]] for b in range(batch)])
with
gnn_out[torch.arange(B, device=flat.device), node_idx.squeeze()]
reduces the gradient time by a factor of 10. However, running ray with ray_init(local_mode=True) is still much faster:

Logit computation graph for posterity:

0

smorad · April 1, 2021, 1:50pm

To anyone else who runs into this issue. I seem to have solved the problem by decreasing the number of workers from 10. I have ample memory free, so it isn’t a disk swapping issue.

smorad · April 5, 2021, 6:12pm

The issue is not the gradients or computation graph, but the loss computation. I’ve narrowed the problem down to the loss computation in torch_policy.py line 434

loss_out = force_list(
            self._loss(self, self.model, self.dist_class, train_batch))

This occurs both with and without vtrace.

Looking at A3CTorchPolicy the slowdown occurs when during in model.from_batch(train_batch) where it calls model.__call__ during with the sample batch. I will investigate further.

eoakes · April 5, 2021, 9:21pm

@sven1977 please take a look

smorad · May 3, 2021, 12:12pm

Just to follow up in case others run into this issue: The issue seems to be with pytorch, probably due to the GPU scheduler. I’ve found at some point, the models will simply experience a 10-100x increase in forward pass time. I’ve dropped into the debugger when the slowdown occurred and fed zeros through the network to verify this. Flushing the torch GPU cache, upgrading to torch-1.8.2, and other various approaches do not appear to fix the issue.

I’ve found this issue only occurs if the ray trainers get more rollouts than they can process. If the rollout queue is quickly emptied and remains empty most of the time, this issue does not seem to occur.

TL;DR If you run into this issue, either decrease the number of workers/envs to reduce the rate at which rollouts are produced, or make your model more efficient.

Bam4d · May 3, 2021, 12:44pm

@sven1977 this is the same issue as in this discussion: [RLlib] Ray trains extremely slow when learner queue is full

sven1977 · May 4, 2021, 7:23am

Very cool, thanks so much for digging into this @smorad and finding the bug on the torch end! And for updating the posts with the links @Bam4d !

Bam4d · May 29, 2021, 9:57am

@sven1977 I have done some more looking into this and I have come up with a workaround that may be the solution to this issue.

Basically what I think is happening is the pytorch gpu scheduler needs to interrupt a thread to either read/write/execute GPU instructions. And to do this the thread needs to be in an interruptible state. In rllib code, the python actually spinlocks (not interruptible) if the queue is empty for example in ray/concurrency_ops.py at master · ray-project/ray · GitHub

    def __call__(self, x: Any) -> Any:
        try:
            self.queue.put_nowait(x)
        except queue.Full:
            return _NextValueNotReady()
        return x

The trick I found that works is to put a tiny “sleep(0.0001)” which actually sleeps and releases the thread at the OS level (which is interruptible) in the except and the problem just disappears. I think doing a put with a timeout might also work.

I dont know if there any many more places that spinlocks are being used to wait on queues, but this is generally a bad practice unless in real-time-critical code (which this is not).

I’m not entirely sure if this causes the workers to “run slower” though and this would need to be a backpressure mechanism to avoid memory leaks where the environments just keep creating more and more data.

I’ll have more of a play with this today and see if I can get a PR out.

smorad · May 29, 2021, 6:51pm

time.sleep(0) should yield without adding any additional delays: https://stackoverflow.com/a/790246

I think adding this before the return with a comment explaining why would be an ideal solution. If the queue is full, then the thread should yield its quantum to a thread that can actually make use of the computation.

Bam4d · May 31, 2021, 9:29am

Ah this is great, I did not know this

I’ll try this out and see if it works.
If it works, ill link the PR here.

Bam4d · May 31, 2021, 3:33pm

So the sleep(0) does not work but a 1ms sleep works perfectly.

The only issue is that the sample throughput is calculated incorrectly as the timer does not take into account that if queues are full there is practially no wait time for subsequent batches. I’ve treated this as a different bug so have not included it as a fix here.

github.com/ray-project/ray

[rllib] Remove bad spinlocks to allow pytorch GPU scheduler to interrupt.

ray-project:master ← Bam4d:spinlocks

opened 03:22PM - 31 May 21 UTC

Bam4d

+11 -4

## Why are these changes needed? Discussion of this bug and the fix can be fo…und here: https://discuss.ray.io/t/very-slow-gradient-descent-on-remote-workers/1278 and here: https://discuss.ray.io/t/rllib-ray-trains-extremely-slow-when-learner-queue-is-full/289/6 Summary: pytorch GPU scheduler breaks and defaults to using CPU if sample queues in the IMPALA implementation are full. When the sample queues are full, the worker threads spinlock which does not allow the [thread to be interrupted and used i the GPU scheduler](https://stackoverflow.com/questions/60086108/forward-pass-gets-10000x-slower-after-iterating-for-a-while). Fix: Instead of spinlocking, introduce a small sleep (which allows the thread to be interrupted). This also stops wasting of CPU time spinning. By default if the `_NextValueNotReady(Exception)` class is used to mark queues that are full then a small sleep is executed. This small wait fixes the bug and keeps the GPU running at full capacity! @smorad I did try `sleep(0)` however this did not seem to solve the issue, I think the small wait is a good thing in this case as it allows CPU to rest and also at this point there are already 16 samples per queue ready to go which will not slow down learing at all. @sven1977 Please note: This does actually introduce a bug that the "sample_throughput" calculation actually only calculates the throughput through the "ConcatBatches" part of the pipeline, which if there queues are full is tiny. Therefore sample throughput time is completely mis-calculated as it does not need to wait at all for new batches. ## Related issue number Issues are in discuss.ray.io listed above. ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

sven1977 · June 1, 2021, 2:40pm

Awesome, this has been merged!

sven1977 · June 1, 2021, 2:41pm

Thanks @Bam4d and @smorad for your invaluable help here!

sven1977 · June 8, 2021, 7:37am

Actually, @Bam4d @smorad, this PR caused our TD3 tests (agents/ddpg/tests/test_td3.py) to time out very often now.

Try this one here:

github.com/ray-project/ray

[RLlib] Fix PR 16162: Having added sleep to `_NextValueNotReady` causes TD3 tests to become flakey.

ray-project:master ← sven1977:fix_16162_spinlock_td3_problem

opened 07:54AM - 08 Jun 21 UTC

sven1977

+4 -8

This PR offers a fix for PR 16162's problem with causing TD3 tests to fail/timeo…ut often. PR 16162 added a sleep to `_NextValueNotReady` causes TD3 tests to become flakey. - This PR removes that sleep from the c'tor of _NextValueNotReady, but keeps the sleep in place inside IMPALA's learner_thread such that the original issue should remain fixed. - See also this discussions here: https://discuss.ray.io/t/very-slow-gradient-descent-on-remote-workers/1278/9 ## Why are these changes needed? ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

I think the sleep should only be added to learner_thread-using algos (e.g. IMPALA), not in general to the _NextValueNotReady exception.

@smorad @Bam4d , could you let me know, whether you are still seeing the slowdown even with this new PR? This also seems to fix TD3 flakiness/timeouts, which was probably due to the added sleep.

Topic		Replies	Views
[RLlib] Ray trains extremely slow when learner queue is full RLlib	7	2222	May 3, 2021
Ray Train hangs for long time Ray Train	11	1807	July 20, 2022
Ray Train code works locally, not in SageMaker PyTorch job Ray Train	15	1135	January 12, 2022
Errors when test TorchTrainer with the "getting started" code Ray Train	1	525	October 1, 2021
Ray Train silent for 7 min Ray Train	1	466	January 7, 2022

Very slow gradient descent on remote workers

Related topics