Memory not released to default levels: `ray::IDLE` Processes Not Released**

High CPU usage by ray::IDLE processes is often caused by underlying issues in Ray’s worker or event loop implementation, or by misreporting in monitoring tools. Sometimes, the ray::IDLE process is not truly idle but is mislabeled, or it may be busy with background tasks such as gRPC polling, event handling, or memory management. This can also occur if there are bugs or inefficiencies in Ray’s thread or process management, especially under heavy load or with certain Ray Serve configurations (Ray Discourse: ray::IDLE using 80% of CPU).

If you observe this, check for recent Ray issues or updates, and consider profiling the process using tools like py-spy or the Ray Dashboard’s CPU flame graph to see what the process is actually doing. Sometimes, this is a known bug and may be fixed in a newer Ray release (Ray Discourse: ray::IDLE using 80% of CPU).

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM What processes like metric collection or deployment pinning may consume cpu resources also? Can we control their cpu usage with some contants or parameters? If so, provide a list of those parameters

It seems likely that Ray background processes such as metric collection, event aggregation, and deployment pinning can consume CPU resources, especially under heavy load or with frequent reporting. You can control their CPU usage with several Ray configuration parameters:

  • metrics_report_interval_ms: Sets how often metrics are exported (default: 10000 ms). Lowering the frequency reduces CPU usage (ray_config_def.h).
  • metrics_report_batch_size: Controls the batch size for metrics export (default: 10000). Smaller batches may increase CPU usage due to more frequent exports (ray_config_def.h).
  • enable_metrics_collection: Enables or disables metrics collection (default: true). Disabling it reduces CPU usage (ray_config_def.h).
  • enable_open_telemetry: Switches metrics backend to OpenTelemetry (default: false). May affect CPU usage depending on backend (ray_config_def.h).
  • io_context_event_loop_lag_collection_interval_ms: Sets the interval for monitoring event loop lag (default: 10000 ms) (ray_config_def.h).
  • RAY_SERVE_METRICS_EXPORT_INTERVAL_MS: Controls how often Ray Serve exports cached metrics (default: 100 ms) (constants.py).

These parameters can be set via Ray’s --system-config or environment variables. Would you like more detail on how to set these in your deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

These parameters can be set via Ray’s –system-config or environment variables. Would you like more detail on how to set these in your deployment?

Yes

Certainly! To set these parameters in your Ray deployment, you can use the --system-config flag when starting Ray, or set the corresponding environment variables before launching Ray processes.

1. Using --system-config (recommended for most parameters):

Add the --system-config flag to your ray start command, passing a JSON string with your desired settings. For example:

ray start --head --system-config='{
  "metrics_report_interval_ms": 30000,
  "metrics_report_batch_size": 5000,
  "enable_metrics_collection": false,
  "enable_open_telemetry": false,
  "io_context_event_loop_lag_collection_interval_ms": 60000
}'

(ray_config_def.h)

2. Using environment variables (for some Ray Serve and other parameters):

Set environment variables before starting Ray. For example, in your shell or Dockerfile:

export RAY_SERVE_METRICS_EXPORT_INTERVAL_MS=1000

(constants.py)

3. In Ray cluster YAML (for autoscaler-managed clusters):

Add the system config to your head_start_ray_commands and worker_start_ray_commands:

head_start_ray_commands:
  - ray stop
  - ray start --head --system-config='{"metrics_report_interval_ms":30000,"enable_metrics_collection":false}'

Would you like a full YAML example or more details on a specific parameter?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM How to estimate cpu usage of each deployment replica or remote function?

To estimate CPU usage of each deployment replica or remote function, use the Ray Dashboard’s Metrics view, which provides per-component CPU usage graphs for each Actor (including Serve replicas) and Task. You can also access these metrics programmatically via Prometheus by scraping the ray_component_cpu_percentage metric, which reports CPU usage per process/component (Ray observability docs, Ray metrics example).

For Serve deployments, filter by deployment or replica in the dashboard or Grafana. For remote functions, use the Actors view or query Prometheus for the relevant process. This allows you to identify which replica or function is consuming the most CPU over time.

Sources:

Hint: Mention @RunLLM in the post for followups.