Ray 2.20 now pulls a grpcio package that doesn't match it's own requirement

Ran_Cao · June 5, 2025, 7:14pm

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.20
Python version: 3.10
OS: linux
Cloud/Infrastructure: aws
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected:
I have been training on ray 2.20 and this yaml config for over a year now since we are on the old ModelV2 API and everything is stable.
Actual:
This Monday (June 2nd) as usual I tried to kick off a new experiment to my old ray cluster from last week and all the workers will finish their setup and immediately went away as soon as they are ready to get added to the cluster. I looked at the log and it says The version of grpcio doesn’t follow Ray’s requirement

I tried to spin up a new cluster under ray 2.20 and then similarly I don’t even have my headnode in the cluster dashboard anymore. Though on EC2 it’s still running, and I can do ray down yaml config to close them all.

Since seriously nothing changed I went to look at this grpc error and found that grpcio launched a new version that my new nodes are using and suspect that’s the issue. However I couldn’t really solve it by just saying use an older version of grpcio, I think ray pulls it behind the back no matter what, and I couldn’t get pip install ray --no-dep to work properly after manually installing all the dependencies i can find.

Does this look like a ray issue? Is there any other library that’s pulling it?

Here is roughly the yaml I use

cluster_name: hy1

max_workers: 50
upscaling_speed: 1.0
idle_timeout_minutes: 6

provider:
   type: aws
   region: us-west-1
   availability_zone: us-west-1a,us-west-1b
   cache_stopped_nodes: False

auth:
   ssh_user: ubuntu
   ssh_private_key: 


available_node_types:
   ray.head.default:
       node_config:
           InstanceType: m5.large
           ImageId: 
           KeyName: ray-cluster
           BlockDeviceMappings:
               - DeviceName: /dev/sda1
                 Ebs:
                     VolumeSize: 140
                     VolumeType: gp3
       resources: {"CPU": 0}

   16_cpu_learner:
       min_workers: 0
       max_workers: 1
       node_config:
           InstanceType: m7i.4xlarge
           ImageId: 
           KeyName: ray-cluster
           InstanceMarketOptions: {}

   48_cpu_worker:
       min_workers: 0
       max_workers: 20
       node_config:
           InstanceType: c7i.12xlarge
           ImageId: 
           KeyName: ray-cluster
           InstanceMarketOptions: {}


head_node_type: ray.head.default

cluster_synced_files: []

rsync_exclude:
   - "**/.git"
   - "**/.git/**"

rsync_filter:
   - ".gitignore"

initialization_commands:
   - sudo apt update
   - sudo apt install python3-pip -y
   - conda create -n p310 -y python=3.10

setup_commands:
   - conda activate p310 && pip install -U pyarrow==16.0.0 "ray[all]==2.20.0" boto3==1.36.8 func_timeout gymnasium numpy==1.26.4 torch==2.5.1 tensorboard pygame==2.5.2 dm_tree scikit-image lz4

head_setup_commands: []

worker_setup_commands: 
   - mkdir -p /home/ubuntu/workspace/game/Build/Linux/

head_start_ray_commands:
   - conda activate p310 && ray stop
   - conda activate p310 && ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 

worker_start_ray_commands:
   - conda activate p310 && ray stop
   - conda activate p310 && ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I’m also working on migrating our stuff to 2.40 using new API since at least ray dashboard functions properly still. But experienced worker hanging issue where it seems like one worker node could have finished setup but still stuck in the ‘setting up’ phase causing the entire infrastructure to stuck in a loop of adding and removing nodes due to auto scale. Maybe something with the worker_start_ray_commands no longer works properly? (This setup has been fine the entire past year)

Topic		Replies	Views
Crash w/ grpcio error but version matches ray's req	1	621	October 17, 2023
Ray and python versions Ray Core	9	2457	February 13, 2023
Ray init fail in my local server with error agent_manager.cc:135: Ray Core	8	1791	December 19, 2023
Changing the ray version is not allowed Ray Clusters	0	445	November 4, 2022
Issue with ray cluster in Red hat machine Ray Clusters	1	479	August 26, 2022

Ray 2.20 now pulls a grpcio package that doesn't match it's own requirement

Related topics