AttributeError: 'NoneType' object has no attribute 'trace' in 2.54.0

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.54
  • Python version: 3.12
  • OS: ubuntu 22.4
  • Cloud/Infrastructure: none
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
  • Actual:

crash when moving to 2.54.0 from 2.53.0

the following is the summary of everything i learned. should i open a bug in github or is there a better solution?

_opentelemetry Serialization Asymmetry in Ray Client Mode (2.54.0+)

Component: ray/util/tracing/tracing_helper.py, ray/_private/function_manager.py, ray/actor.py

Versions affected: 2.54.0, 2.55.1 (and likely all subsequent)

Regression from: 2.53.0


Summary

When using Ray Client (ray.init(“ray://…”)) with the --tracing-startup-hook flag configured on the head node, actor init crashes on the
worker with:

AttributeError: ‘NoneType’ object has no attribute ‘trace’

inside _resume_span in tracing_helper.py.


Root cause
Ray 2.54.0 made two related changes:

  1. function_manager.py now calls _inject_tracing_into_class(actor_class) on the worker side after loading the actor class from GCS.
  2. tracing_helper.py added a ray_tracing_wrapped marker so that worker-side re-injection skips already-wrapped methods (to prevent
    double-wrapping after the cloudpickle round-trip).
    Together these create a serialization asymmetry when the driver is a Ray Client (not a full cluster node):
  • On the driver, _make_actor calls _inject_tracing_into_class, which wraps actor methods with _resume_span closures and stamps them
    ray_tracing_wrapped = True.
  • The _resume_span closure captures two module-level names from tracing_helper:
    • _is_tracing_enabled — a function reference. cloudpickle serializes top-level functions by module reference, so on the worker it resolves to
      the worker’s live function, which returns True (the worker called _enable_tracing() via the startup hook).
    • _opentelemetry — a variable value. cloudpickle serializes this by value at pickle time. On a Ray Client driver, _enable_tracing() is never
      called, so this value is None.
  • On the worker, _inject_tracing_into_class sees ray_tracing_wrapped = True and skips re-wrapping. The stale closure is used as-is.
  • When init runs, _resume_span evaluates _is_tracing_enabled() → True and _ray_trace_ctx is non-None (injected by _tracing_actor_creation on
    the driver because the cluster has tracing enabled), so it proceeds to call _opentelemetry.trace → None.trace → crash.
    In Ray 2.53.0 neither of these changes existed: function_manager.py did not call _inject_tracing_into_class on the worker, and there was no
    ray_tracing_wrapped guard, so the worker always re-wrapped with a fresh closure containing its own valid _opentelemetry.

What does NOT fix it
Upgrading opentelemetry-instrumentation-grpc: The worker pod logs show circular import errors from this package during startup, which look
suspicious. These are a red herring — they are a nuisance from gRPC initialization order but do not affect _opentelemetry being None.
Calling _enable_tracing() on the driver after ray.init(): This sets _global_is_tracing_enabled = True on the driver, which activates
_invocation_actor_class_remote_span (the wrapper around ActorClass.remote()). That wrapper does:
span.set_attribute(“ray.actor_id”, result._ray_actor_id.hex())
In Ray Client mode, result is a ClientActorRef, which has no _ray_actor_id attribute — a second, different crash. So _enable_tracing() cannot be
called on a Ray Client driver.

Workaround
After ray.init(“ray://…”) succeeds on the driver, populate _opentelemetry directly without setting _global_is_tracing_enabled:
import ray.util.tracing.tracing_helper as _th
if _th._opentelemetry is None:
_th._opentelemetry = _th._OpenTelemetryProxy()
This ensures the pickled _resume_span closure carries a valid proxy instead of None, while leaving tracing disabled on the driver so
_invocation_actor_class_remote_span remains a no-op.

Proper fix suggestion
_inject_tracing_into_class should not serialize _opentelemetry by value into the closure. Instead, _resume_span should always read _opentelemetry
from the module at call time (i.e., reference the module, not close over the variable). Alternatively, _invocation_actor_class_remote_span
should guard against ClientActorRef before accessing _ray_actor_id, which would make calling _enable_tracing() on a Ray Client driver safe.
y additional details under this line, such as code or steps to reproduce! →

It seems likely that you have identified a real regression in Ray tracing with Ray Client mode in 2.54.0+ that is not present in 2.53.0. Your analysis matches the changes in how _inject_tracing_into_class and the ray_tracing_wrapped marker interact with closure serialization, leading to stale references to _opentelemetry and resulting in the observed crash. The workaround you provided (manually setting _opentelemetry on the driver) is consistent with the root cause, but is not a robust long-term solution. There does not appear to be a better solution documented in the data sources, and your suggestion for a proper fix (ensuring the closure does not capture _opentelemetry by value) aligns with best practices for serialization in distributed systems.

Given the depth of your investigation and the lack of an official fix or workaround in the documentation or codebase, it seems appropriate to open a GitHub issue with your findings. This will help the Ray maintainers track and address the regression. Be sure to include your summary, reproduction steps, and workaround, as these will be valuable for maintainers and other users encountering the same problem. Would you like more detail on how to structure the GitHub issue or on the relevant code paths involved?
Sources:

Hint: Mention @RunLLM in the post for followups.