Actor task fail running under Serve: is it normal to have this depth?

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.

2. Environment:

  • Ray version: 2.51.1
  • Python version: 3.10
  • OS: python:3.10-slim Docker image running on Ubuntu 24.04
  • Cloud/Infrastructure: GCP
  • Other libs/tools (if relevant): -

3. What happened vs. what you expected:

When a task fails in our system, I see a high depth number in the log message:

task_manager.cc:1321: Task failed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={...}, task_id=..., task_name=..., job_id=01000000, num_args=6, num_returns=1, max_retries=0, depth=660407, attempt_number=0, actor_task_spec={actor_id=..., actor_caller_id=ffffffffffffffffbb5128cfbcad1549ce6532fa01000000, seq_no=5828, retry_exceptions=0}

As far as I can understand, we don’t have anything in our codebase that would explain a call chain this long. Is there anything around Ray Serve that would explain this?

It seems likely that the depth field in Ray task logs represents the call stack depth, incremented each time a task or actor submits another task/actor, starting from the driver at depth 0. A very high depth (e.g., 660,407) is unusual and typically indicates deeply nested or recursive task/actor submissions, but Ray Serve itself does not inherently create such deep chains unless your application logic does so (e.g., deployments calling each other in a loop or recursion) [common.proto, see int64 depth = 27;][task_spec.cc, see stream << ", depth=" << GetDepth()].

My understanding is that unless your code or a library you use is recursively submitting tasks/actors, this depth should not grow so large. There is no documentation or Ray Serve mechanism that would automatically cause such high depth values. Would you like a step-by-step breakdown of how depth is tracked or suggestions for debugging where this might be happening in your code?

Sources:

Hint: Mention @RunLLM in the post for followups.