A few questions about task scheduling and retry

Hello, I am new to Ray and have a few questions about task scheduling and retry:

  • If worker 1 owns task A, is task A always going to be scheduled onto a worker other than worker 1? (Technically an owner is also a worker, so it should be able to run a task?)
  • According to Ray’s whitepaper, when a task is scheduled, its owner will “send the task specification over gRPC to the leased worker”. What is the “task specification” specifically? Where is it stored, in-process of the owner?
  • Suppose worker 1 scheduled task A to worker 2 on the same node, and that node failed and task A also failed as a consequence. According to the documentation, "when a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine failed, Ray will rerun the task (after a delay of several seconds) until either the task succeeds or the maximum number of retries is exceeded. " How/Where would Ray retrieve the specification of task A in order to rerun the task?

Any clarification would be greatly appreciated. Thanks in advance!

  1. Yes, this can happen in certain cases if the owner worker yields its CPU by calling ray.get() or ray.wait(). Otherwise, it would never happen since the worker is considered occupied while its task is running.
  2. The task spec is a protobuf (wrapper class here: ray/task_spec.cc at master · ray-project/ray · GitHub)
  3. In this case, the entire program is lost (assuming worker 1 is the “root node” of the program). You’d need the root worker to be a detached actor with auto restart enabled for recovery to work in this case (i.e., this is what Serve does).