Yes, if your GPU object store memory (or overall node memory) is nearly full (e.g., 87.7% as shown), this can cause Ray tasks to queue, slow down, or even get stuck due to memory pressure and backpressure in the pipeline. When object store memory is full, Ray cannot create or fetch new objects efficiently, leading to high queueing times and idle actors (source, source).
To resolve this, reduce memory usage by lowering parallelism, increasing batch granularity, or scaling up your cluster with more memory. Also, monitor object store usage in the Ray Dashboard and consider enabling object spilling to disk if not already configured. Would you like more detail on configuring object spilling or memory management?
Sources:
- https://github.com/ray-project/ray/issues/62093
- https://discuss.ray.io/t/memory-ram-not-being-released-by-ray/7213
Hint: Mention @RunLLM in the post for followups.