Based on my understanding, a Ray worker node keeps many results of completed tasks in memory. If a worker node crashes, all those results are lost, and other worker nodes will re-execute the failed tasks.
In my implementation, I’m also using an external key-value store to persist each task’s result, where the key is the Ray task ID and the value is the task’s result.
My question is: when another worker retries a task that was previously completed on a crashed node, will the retried task have the same task ID as the original one? If the task IDs are consistent, I could leverage the external storage to achieve idempotence.
No, retried tasks in Ray do not have the same task ID as the original attempt. Each retry is assigned a unique task ID, so you cannot rely on task ID consistency for idempotence using external storage. This is confirmed by Ray core maintainers and is not explicitly documented, but is discussed in the community and can be inferred from Ray’s design and test code, where each retry is treated as a new task attempt with a new ID (discuss.ray.io, task_manager_test.cc).
If you need idempotence, you should use your own deterministic key (e.g., a hash of task arguments) for external storage, not the Ray task ID. Would you like more detail on how to implement this pattern?