Only 10 ray data actor is working with ray train and the rest is just idle

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.51.2
  • Python version: py3.9
  • OS:
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: All ray data actor is working
  • Actual: Only 10 ray data actor is working

Hi everyone,

I’m using Ray Data together with Ray Train for model training. Ray Data is responsible for data preprocessing. My expectation is that Ray Data workers can scale out horizontally freely. When GPUs are idle waiting for data, increasing the number of workers should speed up data generation.

However, during actual runs, I notice many Ray Data workers show 0% CPU utilization. Only around 10 workers are actively processing data in the entire cluster. I have tried tuning parameters like prefetch_size and other related settings, but none of these attempts worked.

Has anyone encountered this similar issue before? Any suggestions would be greatly appreciated!

# len(file_paths) is about 200000
ds = ray.data.from_items([{"file_spec": p} for p in file_paths],override_num_blocks=int(len(file_paths) / 100))

logger.info(f"Data actor max_size: {args.num_data_workers}")


ds = ds.flat_map(
    ObsDataReader,
    fn_constructor_args=(
        args.obs_buckets_path,
        args.game_martial_upscore,
        args.game_martial_upscore_value,
    ),
    compute=ray.data.ActorPoolStrategy(
        size=args.num_data_workers,
        max_tasks_in_flight_per_actor=3,
    ),
    resources={
        "num-slots": int(1 / args.num_data_workers_per_node * 1000) / 1000
    },
    label_selector={
        "device-type": "cpu",
    }
)

# ...

# train_loop_func
batch_size = 128
train_ds = ray.train.get_dataset_shard("train")
train_dataloader = train_ds.iter_torch_batches(
   batch_size=batch_size,
   # local_shuffle_buffer_size=batch_size*5,
   prefetch_batches=20,
   drop_last=True,
)
train_iterator = iter(train_dataloader)
for batch in train_iterator:
    # ...

This issue is often caused by resource contention between Ray Train (training workers) and Ray Data (data preprocessing workers). If all available CPUs are allocated to training workers, Ray Data cannot schedule enough actors for preprocessing, resulting in low CPU utilization and only a few active Ray Data workers. The recommended practice is to leave some CPUs unallocated to training so Ray Data can use them for preprocessing, which improves throughput and keeps GPUs fed with data. For example, if each node has 8 CPUs, allocate only 6 per training worker and leave 2 for Ray Data operations (source, source).

Additionally, ensure that your Ray Data parallelism (number of partitions/blocks) is set high enough to utilize all available workers, but not so high that it exceeds available resources and causes excessive task queuing (source). Would you like more detail on how to tune these parameters or example code for your setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I’m sure that every actor have enough resource. For example, I just allocate two actors on each 16c32g node.

If each node has 16 CPUs and 32GB RAM, and you only allocate two actors per node, each actor should have ample resources. However, if you still see only ~10 Ray Data actors active cluster-wide, the bottleneck may be due to Ray Data’s internal scheduling, backpressure from downstream consumers (e.g., training workers not consuming data fast enough), or insufficient parallelism in the data pipeline (e.g., too few data blocks or partitions) (source).

To diagnose further, check if the number of data blocks matches or exceeds the number of actors, and ensure downstream stages (like training) are not limiting throughput. Also, verify that Ray Data is not scaling down actors due to backpressure or idle time. Would you like step-by-step guidance on how to check and adjust these settings?

Sources:

Hint: Mention @RunLLM in the post for followups.

I have used override_num_blocks to reset the blocks. as len(file_paths)

is about 0.2M items, I should have many blocks, which is much larger than the num of actors.

@RunLLM

Setting override_num_blocks=int(len(file_paths) / 100) with 200,000 items should indeed create ~2,000 blocks, which is much more than the number of actors. If you still see only a small number of actors working, the issue is likely not with block count, but with Ray Data’s scheduling, backpressure, or downstream consumption rate (source, source).

Check if downstream operators (e.g., training or further data processing) are consuming data fast enough, as Ray Data may scale down active actors if there is backpressure or idle time. Would you like more detail on how to monitor and address backpressure or actor scaling in Ray Data?

Sources:

Hint: Mention @RunLLM in the post for followups.

In fact, there is a weird thing, some actor tasks have stucked and I don’t know why.

some actor task is still running even if my __call__ exit.

@RunLLM

If some Ray actor tasks remain running even after your __call__ method exits, it may be due to lingering background threads, unfinished async tasks, or Ray Data’s internal task scheduling/backpressure mechanisms. Actors can appear “stuck” if downstream consumers are slow, if there are resource deadlocks, or if the actor’s main thread is blocked by something other than your __call__ logic (source, source).

To debug, use the Ray Dashboard’s “Stack Trace” feature for the stuck actor to see where its threads are blocked. This can reveal if the actor is waiting on I/O, blocked on a queue, or stuck in a Ray internal call. Would you like step-by-step instructions on interpreting the stack trace or further debugging tips?

Sources:

Hint: Mention @RunLLM in the post for followups.

Process 912: ray::MapWorker(FlatMap(ObsDataReader)).submit
Python v3.9.19 (/root/miniconda3/bin/python3.9)

Thread 912 (idle): "MainThread"
    main_loop (ray/_private/worker.py:1042)
    <module> (ray/_private/workers/default_worker.py:322)
Thread 1122 (idle): "PythonGCThread"
    wait (threading.py:312)
    wait (threading.py:581)
    run (ray/_private/gc_collect_manager.py:34)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 1162 (idle): "ThreadPoolExecutor-0_0"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 2547 (idle): "Thread-5"
    wait (threading.py:316)
    wait (threading.py:581)
    poll_uss (ray/data/_internal/util.py:1629)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)

@RunLLM this is my stacktrace

Your stack trace shows all threads are idle, with the main thread in main_loop of Ray’s worker, and no user code running. This suggests the actor is not actively processing a task and is likely waiting for new work or blocked by Ray Data’s internal scheduling or backpressure (source). The presence of poll_uss in a background thread is normal for Ray’s memory monitoring.

If your __call__ has exited and the actor remains in this state, it’s likely not a bug but a result of Ray’s actor lifecycle: the actor process stays alive, waiting for new tasks, unless explicitly terminated. If you see many such idle actors and suspect a leak or resource waste, consider using max_calls or explicit actor shutdown (source). Would you like more detail on actor lifecycle management or how to force actor cleanup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Task execution event stats:

Global stats: 342 total (4 active)Queueing time: mean = 6212.52ms, max = 690507.08ms, min = 0.00ms, total = 2124681.08msExecution time: mean = 2077.62ms, total = 710546.74msEvent stats:CoreWorker.CheckSignal - 324 total (1 active), Execution time: mean = 0.00ms, total = 0.90ms, Queueing time: mean = 2192.96ms, max = 690507.08ms, min = 0.01ms, total = 710519.42msCoreWorker.HandlePushTaskActor - 9 total (1 active), Execution time: mean = 0.06ms, total = 0.51ms, Queueing time: mean = 134089.95ms, max = 690398.40ms, min = 0.01ms, total = 1206809.58msCoreWorker.ArgWaitComplete - 7 total (2 active, 1 running), Execution time: mean = 101436.99ms, total = 710058.95ms, Queueing time: mean = 29621.72ms, max = 202089.81ms, min = 0.01ms, total = 207352.06msPeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.00ms, max = 0.00ms, min = 0.00ms, total = 0.00msCoreWorker.HandlePushTask - 1 total (0 active), Execution time: mean = 486.38ms, total = 486.38ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms

this is my actor system stats log @RunLLM

Your event stats show extremely high queueing times (mean >6s, max >11min) for task execution, especially in CoreWorker.CheckSignal and CoreWorker.HandlePushTaskActor. This indicates severe scheduling or resource contention: tasks are waiting a long time before execution, likely due to insufficient available resources, backpressure, or blocked dependencies (source).

The high ArgWaitComplete execution time suggests tasks are also waiting on arguments (possibly data dependencies or object store delays). This pattern is typical when the cluster is overloaded, or when upstream tasks are not producing data fast enough. Would you like more detail on how to diagnose resource bottlenecks or optimize task scheduling?

Sources:

Hint: Mention @RunLLM in the post for followups.

so how can I solve it? @RunLLM

To solve this, ensure your cluster has enough available resources (CPUs, memory, object store) for both Ray Data and Ray Train, and avoid over-allocating resources to training workers. Also, increase the number of Ray Data actors only if the cluster can support them, and repartition your dataset to match the desired parallelism (source, source).

Monitor the Ray Dashboard for resource utilization and backpressure. If queueing times remain high, reduce the number of concurrent tasks or actors, or increase cluster resources. Make sure downstream consumers (e.g., training) are not bottlenecked, and consider tuning batch sizes or prefetch settings. Would you like step-by-step instructions for resource tuning or dashboard monitoring?

Sources:

Hint: Mention @RunLLM in the post for followups.

so sad, hard to solve :frowning:

so maybe the reason is that my gpu object storage memory is full ?

@RunLLM

Yes, if your GPU object store memory (or overall node memory) is nearly full (e.g., 87.7% as shown), this can cause Ray tasks to queue, slow down, or even get stuck due to memory pressure and backpressure in the pipeline. When object store memory is full, Ray cannot create or fetch new objects efficiently, leading to high queueing times and idle actors (source, source).

To resolve this, reduce memory usage by lowering parallelism, increasing batch granularity, or scaling up your cluster with more memory. Also, monitor object store usage in the Ray Dashboard and consider enabling object spilling to disk if not already configured. Would you like more detail on configuring object spilling or memory management?

Sources:

Hint: Mention @RunLLM in the post for followups.

but my gpu trainer is actively consuming. How can I track the real object occupied in gpu worker node?

@RunLLM

To track real object store usage on a GPU worker node, use the Ray Dashboard’s “Object Store” tab or run ray memory on the node to see which objects are occupying memory and their sizes (source). For programmatic access, you can parse the output of ray memory or use Prometheus metrics like ray_object_store_memory if you have metrics enabled (source).

Would you like more detail on using these tools or example code for programmatic monitoring?

Sources:

Hint: Mention @RunLLM in the post for followups.