1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.54
- Python version: 3.12
- OS: ubuntu 22.4
- Cloud/Infrastructure: none
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected:
- Actual:
crash when moving to 2.54.0 from 2.53.0
the following is the summary of everything i learned. should i open a bug in github or is there a better solution?
_opentelemetry Serialization Asymmetry in Ray Client Mode (2.54.0+)
Component: ray/util/tracing/tracing_helper.py, ray/_private/function_manager.py, ray/actor.py
Versions affected: 2.54.0, 2.55.1 (and likely all subsequent)
Regression from: 2.53.0
Summary
When using Ray Client (ray.init(“ray://…”)) with the --tracing-startup-hook flag configured on the head node, actor init crashes on the
worker with:
AttributeError: ‘NoneType’ object has no attribute ‘trace’
inside _resume_span in tracing_helper.py.
Root cause
Ray 2.54.0 made two related changes:
- function_manager.py now calls _inject_tracing_into_class(actor_class) on the worker side after loading the actor class from GCS.
- tracing_helper.py added a ray_tracing_wrapped marker so that worker-side re-injection skips already-wrapped methods (to prevent
double-wrapping after the cloudpickle round-trip).
Together these create a serialization asymmetry when the driver is a Ray Client (not a full cluster node):
- On the driver, _make_actor calls _inject_tracing_into_class, which wraps actor methods with _resume_span closures and stamps them
ray_tracing_wrapped = True. - The _resume_span closure captures two module-level names from tracing_helper:
- _is_tracing_enabled — a function reference. cloudpickle serializes top-level functions by module reference, so on the worker it resolves to
the worker’s live function, which returns True (the worker called _enable_tracing() via the startup hook). - _opentelemetry — a variable value. cloudpickle serializes this by value at pickle time. On a Ray Client driver, _enable_tracing() is never
called, so this value is None.
- _is_tracing_enabled — a function reference. cloudpickle serializes top-level functions by module reference, so on the worker it resolves to
- On the worker, _inject_tracing_into_class sees ray_tracing_wrapped = True and skips re-wrapping. The stale closure is used as-is.
- When init runs, _resume_span evaluates _is_tracing_enabled() → True and _ray_trace_ctx is non-None (injected by _tracing_actor_creation on
the driver because the cluster has tracing enabled), so it proceeds to call _opentelemetry.trace → None.trace → crash.
In Ray 2.53.0 neither of these changes existed: function_manager.py did not call _inject_tracing_into_class on the worker, and there was no
ray_tracing_wrapped guard, so the worker always re-wrapped with a fresh closure containing its own valid _opentelemetry.
What does NOT fix it
Upgrading opentelemetry-instrumentation-grpc: The worker pod logs show circular import errors from this package during startup, which look
suspicious. These are a red herring — they are a nuisance from gRPC initialization order but do not affect _opentelemetry being None.
Calling _enable_tracing() on the driver after ray.init(): This sets _global_is_tracing_enabled = True on the driver, which activates
_invocation_actor_class_remote_span (the wrapper around ActorClass.remote()). That wrapper does:
span.set_attribute(“ray.actor_id”, result._ray_actor_id.hex())
In Ray Client mode, result is a ClientActorRef, which has no _ray_actor_id attribute — a second, different crash. So _enable_tracing() cannot be
called on a Ray Client driver.
Workaround
After ray.init(“ray://…”) succeeds on the driver, populate _opentelemetry directly without setting _global_is_tracing_enabled:
import ray.util.tracing.tracing_helper as _th
if _th._opentelemetry is None:
_th._opentelemetry = _th._OpenTelemetryProxy()
This ensures the pickled _resume_span closure carries a valid proxy instead of None, while leaving tracing disabled on the driver so
_invocation_actor_class_remote_span remains a no-op.
Proper fix suggestion
_inject_tracing_into_class should not serialize _opentelemetry by value into the closure. Instead, _resume_span should always read _opentelemetry
from the module at call time (i.e., reference the module, not close over the variable). Alternatively, _invocation_actor_class_remote_span
should guard against ClientActorRef before accessing _ray_actor_id, which would make calling _enable_tracing() on a Ray Client driver safe.
y additional details under this line, such as code or steps to reproduce! →