If each node has 16 CPUs and 32GB RAM, and you only allocate two actors per node, each actor should have ample resources. However, if you still see only ~10 Ray Data actors active cluster-wide, the bottleneck may be due to Ray Data’s internal scheduling, backpressure from downstream consumers (e.g., training workers not consuming data fast enough), or insufficient parallelism in the data pipeline (e.g., too few data blocks or partitions) (source).
To diagnose further, check if the number of data blocks matches or exceeds the number of actors, and ensure downstream stages (like training) are not limiting throughput. Also, verify that Ray Data is not scaling down actors due to backpressure or idle time. Would you like step-by-step guidance on how to check and adjust these settings?
Sources:
Hint: Mention @RunLLM in the post for followups.