WandbLoggerCallback easily hits wandb API rate limits

I’m using the WandbLoggerCallback with Ray Tune and RLlib. On bigger Ray cluster instances and running many experiments in parallel, this easily hits the rate limit on the wandb backend API, due to too many requests coming in at once.

The generic advice from wandb is to call wandb.log() less often, which I am already doing by making the reporting interval in RLlib longer. But at this point I’m down to logging 100 times total for each experiment - I can’t really go any lower and still get useful learning curves out of my experiments.

So my question: Is there a way to just sync less often? I’m perfectly happy to get my data to wandb with a delay, e.g. for the client to batch 10 reporting intervals together instead of sending each individually, so long is the data for each individual interval is still in there. Is this possible? Thank you!

Hey @mgerstgrasser, unfortunately it looks like W&B does not provide an API to control syncing interval, or to batch multiple results on the client side and then log them with a single wandb.log call.

From their docs, wandb is supposed to identify if it’s being rate limited and will automatically do a backoff and retry if so: Limits & Performance - Documentation.

I would contact the W&B team or post on their forum if this exponential backoff is not working for you.

@amogkam I’ve managed to talk to W&B support in the mean time, and it seems I had misunderstood their documentation. Their client is already doing exactly what it should: If it hits the rate limit, it will send data less frequently. I had interpreted their exponential backoff description to mean that the client would still only send one log entry at a time (and so accumulate a backlog if you’re logging things faster than their API rate limit), but their support tells me the client will always send all the accumulated log entries the next time it connects to their backend.

So all good. :slight_smile:

Wanted to chime in here. Every call to wandb.log is actually queued up before it’s sent to a server. At the beginning of an experiment we stream batched data every couple seconds and after a few minutes we stream once every 30 seconds.

Rate limits happen if there are a large number of concurrent experiments streaming data at the same time. This can be more pronounced if each experiment is really short since we stream more often in that case.

As @mgerstgrasser correctly described when a client receives a 429 we enter into an exponential backoff which will never block any of your scripts and results only in data being a bit more delayed in getting to the server.

I would be curious to understand what scenario is actually happening in this case that’s causing requests to be rate limited, but regardless we hope the impact is minimal.

@vanpelt Thank you so much for clarifying. I think what confused me is this page in the wandb docs: Limits & Performance - Documentation There it says “This rate allows you to run approximately 15 processes in parallel” and “If you need to run more than 15 processes in parallel send an email to …” - but it sounds like you’re perfectly fine running more than that, if you’re happy for your data to be synced less frequently in each process! In hindsight I see now the “without being throttled” there, but I missed that on my first read.

As for my scenario, it’s really just a large number of experiments in parallel, as you say. If I’m writing a paper, I often need to run an algorithm plus several baselines, one multiple domains, with multiple seeds. That often multiplies out to well over a hundred experiments. And I often like to run one final batch of everything, after I’ve fixed all the bugs and finalized all my code, so I’ll submit all of them in one go to either a slurm cluster (running ray), or directly to a ray cluster.