Memory not released to default levels: `ray::IDLE` Processes Not Released**

RunLLM · November 13, 2025, 9:52pm

High CPU usage by ray::IDLE processes is often caused by underlying issues in Ray’s worker or event loop implementation, or by misreporting in monitoring tools. Sometimes, the ray::IDLE process is not truly idle but is mislabeled, or it may be busy with background tasks such as gRPC polling, event handling, or memory management. This can also occur if there are bugs or inefficiencies in Ray’s thread or process management, especially under heavy load or with certain Ray Serve configurations (Ray Discourse: ray::IDLE using 80% of CPU).

If you observe this, check for recent Ray issues or updates, and consider profiling the process using tools like py-spy or the Ray Dashboard’s CPU flame graph to see what the process is actually doing. Sometimes, this is a known bug and may be fixed in a newer Ray release (Ray Discourse: ray::IDLE using 80% of CPU).

Sources:

Ray Discourse: ray::IDLE using 80% of CPU

Hint: Mention @RunLLM in the post for followups.

dmtryzarubin · November 14, 2025, 10:49am

@RunLLM What processes like metric collection or deployment pinning may consume cpu resources also? Can we control their cpu usage with some contants or parameters? If so, provide a list of those parameters

RunLLM · November 14, 2025, 10:50am

It seems likely that Ray background processes such as metric collection, event aggregation, and deployment pinning can consume CPU resources, especially under heavy load or with frequent reporting. You can control their CPU usage with several Ray configuration parameters:

metrics_report_interval_ms: Sets how often metrics are exported (default: 10000 ms). Lowering the frequency reduces CPU usage (ray_config_def.h).
metrics_report_batch_size: Controls the batch size for metrics export (default: 10000). Smaller batches may increase CPU usage due to more frequent exports (ray_config_def.h).
enable_metrics_collection: Enables or disables metrics collection (default: true). Disabling it reduces CPU usage (ray_config_def.h).
enable_open_telemetry: Switches metrics backend to OpenTelemetry (default: false). May affect CPU usage depending on backend (ray_config_def.h).
io_context_event_loop_lag_collection_interval_ms: Sets the interval for monitoring event loop lag (default: 10000 ms) (ray_config_def.h).
RAY_SERVE_METRICS_EXPORT_INTERVAL_MS: Controls how often Ray Serve exports cached metrics (default: 100 ms) (constants.py).

These parameters can be set via Ray’s --system-config or environment variables. Would you like more detail on how to set these in your deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

dmtryzarubin · November 14, 2025, 10:51am

@RunLLM

These parameters can be set via Ray’s –system-config or environment variables. Would you like more detail on how to set these in your deployment?

Yes

RunLLM · November 14, 2025, 10:52am

Certainly! To set these parameters in your Ray deployment, you can use the --system-config flag when starting Ray, or set the corresponding environment variables before launching Ray processes.

1. Using --system-config (recommended for most parameters):

Add the --system-config flag to your ray start command, passing a JSON string with your desired settings. For example:

ray start --head --system-config='{
  "metrics_report_interval_ms": 30000,
  "metrics_report_batch_size": 5000,
  "enable_metrics_collection": false,
  "enable_open_telemetry": false,
  "io_context_event_loop_lag_collection_interval_ms": 60000
}'

(ray_config_def.h)

2. Using environment variables (for some Ray Serve and other parameters):

Set environment variables before starting Ray. For example, in your shell or Dockerfile:

export RAY_SERVE_METRICS_EXPORT_INTERVAL_MS=1000

(constants.py)

3. In Ray cluster YAML (for autoscaler-managed clusters):

Add the system config to your head_start_ray_commands and worker_start_ray_commands:

head_start_ray_commands:
  - ray stop
  - ray start --head --system-config='{"metrics_report_interval_ms":30000,"enable_metrics_collection":false}'

Would you like a full YAML example or more details on a specific parameter?

Sources:

ray_config_def.h

Hint: Mention @RunLLM in the post for followups.

dmtryzarubin · November 14, 2025, 12:20pm

@RunLLM How to estimate cpu usage of each deployment replica or remote function?

RunLLM · November 14, 2025, 12:21pm

To estimate CPU usage of each deployment replica or remote function, use the Ray Dashboard’s Metrics view, which provides per-component CPU usage graphs for each Actor (including Serve replicas) and Task. You can also access these metrics programmatically via Prometheus by scraping the ray_component_cpu_percentage metric, which reports CPU usage per process/component (Ray observability docs, Ray metrics example).

For Serve deployments, filter by deployment or replica in the dashboard or Grafana. For remote functions, use the Actors view or query Prometheus for the relevant process. This allows you to identify which replica or function is consuming the most CPU over time.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to control the total memory of ray.serve? Ray Serve	3	888	November 10, 2021
Leaking worker memory Ray Core	9	488	February 19, 2021
RayOutOfMemoryError: More than 95% of the memory is used Ray Core	6	4978	September 9, 2022
Memory (RAM) not being released by Ray Ray Core	17	2213	August 26, 2022
Object store spilling terabytes of data Ray Core	6	2607	January 11, 2023

Memory not released to default levels: `ray::IDLE` Processes Not Released**

Related topics