Ah, I hadn’t looked into uuid1
in detail, but that would explain it!
To your question: I’ve encountered this on two separate clusters. One is a slurm cluster, where I was submitting a number of slurm jobs at the same time, each of them starting its own Ray instance (by just calling ray.init()
in Python). The other was an autoscaling Ray cluster on AWS, and submitted jobs using Ray Jobs API. I think in both cases it’s quite possible that jobs start at the exact same time, even if they are submitted with a small delay between jobs: On Slurm, another big job might finish, suddenly freeing up 40 CPU cores, and then the scheduler might start 40 of my jobs on that machine at the exact same time. Similarly with autoscaling: It’ll take a minute for a new instance to become available, and then a number of jobs might start on that machine at the same time, even if I left a few seconds between submitting the jobs. In both cases if uuid1
is purely based on timestamp (or even timestamp and node ID), then I can see how this would make collisions very plausible. I was submitting around 100 jobs at a time. That’s not unusual for me, I often submit dozens or around 100 jobs at a time. I’ve noticed the collision at least three times I think, but there may have been more instances I didn’t notice, because this fails silently in a very bad way (see below).
I agree that switching to uuid4
could make sense here. With node_index + cpu_index + uuid1().hex[:6]
I think we would need to make sure that this is really always unique. In particular, would cpu_index
be the index of the physical core, or the logical core that this particular Ray instance is pinned on? What happens if two Ray instances are running on the same machine?
The other thing I would suggest should be changed is how wandb.init()
is called. We may not be able to rule out collisions with certainty, so they need to be handled here. Right now, the default behavior of wandb is that if you call wandb.init()
twice with the same id
argument and without setting the resume
kwarg, then the run on wandb will be overwritten (!) - leading to potential data loss. Even worse, if two runs concurrently log to the same run ID, you can end up with a mix of data from both in the same run on wandb. And finally, to make things absolutely bad, if the sequence of events is just right, you can end up with the config logged from one run, but data from another! Imagine for instance you have a “good” algorithm “G” that always logs performance 100, but has some set-up to do and hence a delay between when wandb.init()
is called and when wandb.log()
is called the first time; and a “bad” algorithm “B” with performance 0, but no delay between init and log. Then the following sequence of events can happen, where G and B are two separate processes for the good and bad algorithm, and G starts just before B does:
G calls wandb.init(id="123", config={"algorithm": "good"})
-> Run 123 created with config "algorithm": "good"
B calls wandb.init(id="123", config={"algorithm": "bad"})
-> Run 123 config overwritten to "algorithm": "bad" <---- !!!!
# no delay for B:
B calls wandb.log({"performance": 0}) #implicitly step=0
-> Run 123, step 0 on wandb set to performance: 0
# G after a delay:
G calls wandb.log({"performance": 100}) #implicitly step=0
-> Run 123, step 0 on wandb overwritten to performance: 100 <----- !!!!
Now on wandb you have a run which says it’s from the bad algorithm but with performance 100. That is problematic, obviously - imagine you’re comparing algorithms G and B and you conclude that B is great! You can easily reproduce the above using two Jupyter notebooks executing cells in that order, but it’s not purely theoretical. I’ve had this happen in practice.
I’ve raised this separately with wandb (@vanpelt tagging you here is well). But I think Ray should also handle this. The behavior can be changed with the resume
argument in wandb.init()
(wandb docs). As per above, just passing in resume=False
as Ray does right now is equivalent to setting resume=None
and leads to the overwriting behavior above. I think as the default, setting resume="never"
and wrapping the call to wandb.init()
in a try-except
block would be much safer. wandb.init()
would fail if a run with the same ID already exists, and Ray could handle that (e.g. call it again with a differend ID). If we also wanted to handle resuming trials, we would need to handle that separately and set the resume
argument to wandb.init()
accordingly.