Clashing trial_ids / run_id in wandbloggercallback

I have a couple of times now run into situations where I had two or more runs with clashing trial_ids, which are also used to set the run id on wandb, which needs to be unique. For me both times this happened, I was starting a lot of runs right at the same time on the same cluster (e.g. through a big tune.run() call with 100+ trials). Is this a known issue by any chance?

If I’m just running a single experiment at a time I can work around this by manually setting a run ID when creating the wandb logger callback, but if I’m e.g. doing a Ray Tune run with multiple trials, I can’t do that (since I only create the callback once).

Hey @mgerstgrasser, thanks for raising this issue!

What sort of cluster did you encounter this error on, and how many concurrent trials were you running?

Looks like trial IDs are generated from UUID1:

    @classmethod
    def generate_id(cls):
        return str(uuid.uuid1().hex)[:8]

The first characters eight hex characters of UUID1 are generated from timestamps. So, if two trial are created at the same time, they’ll have the same ID.

@kai @xwjiang2010 what do you think about switching to UUID4? If it’s cheap and supported, we could also do something like node_index + cpu_index + uuid1().hex[:6].

Ah, I hadn’t looked into uuid1 in detail, but that would explain it!

To your question: I’ve encountered this on two separate clusters. One is a slurm cluster, where I was submitting a number of slurm jobs at the same time, each of them starting its own Ray instance (by just calling ray.init() in Python). The other was an autoscaling Ray cluster on AWS, and submitted jobs using Ray Jobs API. I think in both cases it’s quite possible that jobs start at the exact same time, even if they are submitted with a small delay between jobs: On Slurm, another big job might finish, suddenly freeing up 40 CPU cores, and then the scheduler might start 40 of my jobs on that machine at the exact same time. Similarly with autoscaling: It’ll take a minute for a new instance to become available, and then a number of jobs might start on that machine at the same time, even if I left a few seconds between submitting the jobs. In both cases if uuid1 is purely based on timestamp (or even timestamp and node ID), then I can see how this would make collisions very plausible. I was submitting around 100 jobs at a time. That’s not unusual for me, I often submit dozens or around 100 jobs at a time. I’ve noticed the collision at least three times I think, but there may have been more instances I didn’t notice, because this fails silently in a very bad way (see below).

I agree that switching to uuid4 could make sense here. With node_index + cpu_index + uuid1().hex[:6] I think we would need to make sure that this is really always unique. In particular, would cpu_index be the index of the physical core, or the logical core that this particular Ray instance is pinned on? What happens if two Ray instances are running on the same machine?

The other thing I would suggest should be changed is how wandb.init() is called. We may not be able to rule out collisions with certainty, so they need to be handled here. Right now, the default behavior of wandb is that if you call wandb.init() twice with the same id argument and without setting the resume kwarg, then the run on wandb will be overwritten (!) - leading to potential data loss. Even worse, if two runs concurrently log to the same run ID, you can end up with a mix of data from both in the same run on wandb. And finally, to make things absolutely bad, if the sequence of events is just right, you can end up with the config logged from one run, but data from another! Imagine for instance you have a “good” algorithm “G” that always logs performance 100, but has some set-up to do and hence a delay between when wandb.init() is called and when wandb.log() is called the first time; and a “bad” algorithm “B” with performance 0, but no delay between init and log. Then the following sequence of events can happen, where G and B are two separate processes for the good and bad algorithm, and G starts just before B does:

G calls wandb.init(id="123", config={"algorithm": "good"})
   -> Run 123 created with config "algorithm": "good"
B calls wandb.init(id="123", config={"algorithm": "bad"})
   -> Run 123 config overwritten to "algorithm": "bad"    <---- !!!!
# no delay for B:
B calls wandb.log({"performance": 0}) #implicitly step=0
   -> Run 123, step 0 on wandb set to performance: 0
# G after a delay:
G calls wandb.log({"performance": 100}) #implicitly step=0
   -> Run 123, step 0 on wandb overwritten to performance: 100   <----- !!!!

Now on wandb you have a run which says it’s from the bad algorithm but with performance 100. That is problematic, obviously - imagine you’re comparing algorithms G and B and you conclude that B is great! You can easily reproduce the above using two Jupyter notebooks executing cells in that order, but it’s not purely theoretical. I’ve had this happen in practice.

I’ve raised this separately with wandb (@vanpelt tagging you here is well). But I think Ray should also handle this. The behavior can be changed with the resume argument in wandb.init() (wandb docs). As per above, just passing in resume=False as Ray does right now is equivalent to setting resume=None and leads to the overwriting behavior above. I think as the default, setting resume="never" and wrapping the call to wandb.init() in a try-except block would be much safer. wandb.init() would fail if a run with the same ID already exists, and Ray could handle that (e.g. call it again with a differend ID). If we also wanted to handle resuming trials, we would need to handle that separately and set the resume argument to wandb.init() accordingly.

Hey @mgerstgrasser, thanks for the detailed writeup! I’ve responded to your questions below.

With node_index + cpu_index + uuid1().hex[:6] I think we would need to make sure that this is really always unique. In particular, would cpu_index be the index of the physical core, or the logical core that this particular Ray instance is pinned on? What happens if two Ray instances are running on the same machine?

Good questions.

I imagined cpu_index would represent a logical core, and I assumed that two Ray instances wouldn’t run on the same machine.

That said, I didn’t really think this suggestion through. Using UUID4 is probably a simpler and more effective solution.

Right now, the default behavior of wandb is that if you call wandb.init() twice with the same id argument and without setting the resume kwarg, then the run on wandb will be overwritten (!) - leading to potential data loss. Even worse, if two runs concurrently log to the same run ID, you can end up with a mix of data from both in the same run on wandb. And finally, to make things absolutely bad, if the sequence of events is just right, you can end up with the config logged from one run, but data from another!

Yeah, this’d be bad.

I think as the default, setting resume="never" and wrapping the call to wandb.init() in a try-except block would be much safer.

I’ll open a PR to set resume="never" by default.

wandb.init() would fail if a run with the same ID already exists, and Ray could handle that (e.g. call it again with a differend ID).

Rather than handling duplicates ID in the W&B integration, I’ll update the generate_id algorithms to guarantee unique IDs.

Rather than handling duplicates ID in the W&B integration, I’ll update the generate_id algorithms to guarantee unique IDs.

Actually, discussed with my colleague – to guarantee unique IDs, we’d need to add a lot of complexity to the system. So, I think we’re going to stick with UUID4 for now.

That said, I’ll still set resume='never' by default.

@mgerstgrasser I’ve replied to the suggested changes to the default resume behavior in [Tune] Don't overwrite W&B runs by default by bveeramani · Pull Request #29916 · ray-project/ray · GitHub. But for future reference, I wanted to reply here as well:

I don’t think changing the default resume behavior is the correct way here, as it will effectively disallow users from stopping and resuming experiments. I.e. they would per default run into an error when trying to continue their experiments.

The ID clash you experienced seems to be quite unique and should be mitigated by the change to uuid4. If not, you can always pass resume="never" in your wandb config key in the Ray Tune config to manually override this setting.

When changing default settings we try to optimize the common use cases, and for this the current default is probably better than blocking resumes. An option would be resume="auto", but we’ll need a bit more background on its behavior before we make a decision there.

Curious to hear your thoughts!

Thank you, @bveeramani for looking into this! I’m really glad we figured this out. I was honestly beginning to question my sanity over the last few weeks because I kept seeing things on wandb that did not make sense. I apologise in case I came across a bit strongly earlier. This has definitely been one of the more insidious bugs I’ve faced. I even looked into how trial IDs were generated some six weeks ago, saw that it was using uuid, and immediately dismissed it as something to worry about, because everyone knows that uuid conflicts are practically impossible…

Also thank you @kai for looking into this. I’ll reply regarding the resume behavior on the Github PR so we have that all in the same place. Might take me a bit to get to it. I think I already have some data for the different resume behaviors, but need to find it. Short answer is that I think the callback needs to handle resuming runs explicitly to get the right behavior.

Just a couple more thoughts on trial IDs though:

  • With uuid4(), could there ever be a situation where a global RNG seed is set in a way that accidentally makes this deterministic? I’ve seen one occasion where people wanted reproducible experiments but couldn’t make it work with config["seed"], so they resorted to calling all sorts of RNG seeding functions globally using a seed value set through os.environ, and then e.g. random.rand() effectively became deterministic. It seems uuid4() uses os.urandom(), which from a quick google can’t be seeded and should be safe, but just wanted to mention it in case there’s scenarios that I’m not aware of where it might not be truly random.

  • An alternative would be to call wandb.init() without setting an id manually at all, wandb would then auto-generate one (guaranteed to be unique), and return that from the wandb.init() function call. That would have the advantage that we don’t have to worry about clashes at all, but the disadvantage is that we would need to store the wandb run ID somewhere if we later wanted to interrupt and resume the run. I think that would be tricky - callbacks can’t store state into a checkpoint, can they? So probably not practicaly, but wanted to mention it.

Also, @kai, is this clash really such a unique thing? I’ve had it happen on two completely separate occasions on two different clusters now. The only thing I was doing that was maybe a little unusual is that I started each trial separately, either as a separate slurm job or using the Ray Jobs API. I think if you just use a single Tune call for multiple trials (using either a hyperparameter search space or a list of Tune Experiment objects), then it seems it will stagger the trial starts, so there wouldn’t be two starting at the same time, and so no uuid1-clash. But submitting experiments separately can’t be that unusual, plenty of reasons why you might want to do that.