[Roadmap] Ray Q3 2025

Hello everyone! :waving_hand: I’m excited to share what we have in plan for Q3 2025 for Ray. I will try to keep this updated as features get merged in, and rolled out.

Goal: Deliver foundational reliability, performance, and DX improvements across Ray Core, Data, Train, LLM, Serve, RL, Observability, Technical Content, and KubeRay.

Ray Core

Reliability & Fault Tolerance

  • Improve system stability under node and network failures,

    • Including making RPCs tolerant to transient errors
  • Add robust support for preemptible instances

Scheduling & Performance

  • Introduce label-based scheduling for finer-grained resource control.

  • Implement GPU objects with RDMA transfer support for high-performance GPU data handling

Developer Experience

  • Introduce ActorMesh for simplified interaction with groups of actors

  • Improve static typing across the codebase to enhance developer productivity

  • Address outstanding technical debt in core worker components

Ecosystem Integrations

  • Provide official support for reinforcement learning libraries like veRL, OpenRLHF, and ROLL.

Ray Data

Reliability

  • Ensure workloads complete successfully despite cluster failures

Performance

  • Enhance training ingest pipelines with advanced sampling and caching support

Connectors

Usability

  • Schema UDFs

  • Enhanced internal query planning

Ray Train

API

  • Finalize Train v2 API

Performance

  • Implement asynchronous checkpointing

LLM

Goal: Run large models (e.g. DeepSeek) at scale via vLLM on Ray Serve:

  • Prefill diaggregation

  • Large scale DP

  • Custom request routing

  • Elastic expert parallelism

Performance & Efficiency

  • Implement prefill disaggregation to optimize performance for large context models.

  • Develop an intelligent, KV cache-aware router with a pluggable architecture

  • Implement Distributed Parallel (DP) Attention within Ray Serve

Operations

  • Publish updated performance benchmarks

Ecosystem

  • Support SkyRL for reinforcement learning for human feedback (RLHF) workloads

Ray Serve

Serving Flexibility

  • Custom auto‑scaling and routing patterns

  • Async inference support

  • MCP server patterns

  • Integrate label based scheduling

Observability

  • Enhanced tracing support

RLlib

  • Ray RL V2 stack GA

  • Algorithm composability enhancements

Observability

API Release

  • Public launch of unified event export API

Optimization

  • Refactor internals to leverage new export API

Technical Content

  • New technical templates

  • More examples & deep‑dives

KubeRay

Upgrades

  • Productionize the incremental upgrade feature for seamless cluster updates

Hardware Support

  • Streamline support for diverse accelerators, including multiple GPU types, Dynamic Resource Allocation (DRA), and MIG

Autoscaling

  • Continue to improve the functionality and reliability of Autoscaler V2

We love hearing from the community! If there is a feature you’d like to see in Ray in the future, let us know by filing a feature request or comment here. We also have a discussion on the roadmap on GitHub if you prefer to chat there.
Thank you for supporting Ray!

2 Likes