Ray 2.20 now pulls a grpcio package that doesn't match it's own requirement

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.20
  • Python version: 3.10
  • OS: linux
  • Cloud/Infrastructure: aws
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
    I have been training on ray 2.20 and this yaml config for over a year now since we are on the old ModelV2 API and everything is stable.

  • Actual:
    This Monday (June 2nd) as usual I tried to kick off a new experiment to my old ray cluster from last week and all the workers will finish their setup and immediately went away as soon as they are ready to get added to the cluster. I looked at the log and it says The version of grpcio doesn’t follow Ray’s requirement

    I tried to spin up a new cluster under ray 2.20 and then similarly I don’t even have my headnode in the cluster dashboard anymore. Though on EC2 it’s still running, and I can do ray down yaml config to close them all.

    Since seriously nothing changed I went to look at this grpc error and found that grpcio launched a new version that my new nodes are using and suspect that’s the issue. However I couldn’t really solve it by just saying use an older version of grpcio, I think ray pulls it behind the back no matter what, and I couldn’t get pip install ray --no-dep to work properly after manually installing all the dependencies i can find.

Does this look like a ray issue? Is there any other library that’s pulling it?

Here is roughly the yaml I use

cluster_name: hy1

max_workers: 50
upscaling_speed: 1.0
idle_timeout_minutes: 6

provider:
   type: aws
   region: us-west-1
   availability_zone: us-west-1a,us-west-1b
   cache_stopped_nodes: False

auth:
   ssh_user: ubuntu
   ssh_private_key: 


available_node_types:
   ray.head.default:
       node_config:
           InstanceType: m5.large
           ImageId: 
           KeyName: ray-cluster
           BlockDeviceMappings:
               - DeviceName: /dev/sda1
                 Ebs:
                     VolumeSize: 140
                     VolumeType: gp3
       resources: {"CPU": 0}

   16_cpu_learner:
       min_workers: 0
       max_workers: 1
       node_config:
           InstanceType: m7i.4xlarge
           ImageId: 
           KeyName: ray-cluster
           InstanceMarketOptions: {}

   48_cpu_worker:
       min_workers: 0
       max_workers: 20
       node_config:
           InstanceType: c7i.12xlarge
           ImageId: 
           KeyName: ray-cluster
           InstanceMarketOptions: {}


head_node_type: ray.head.default

cluster_synced_files: []

rsync_exclude:
   - "**/.git"
   - "**/.git/**"

rsync_filter:
   - ".gitignore"

initialization_commands:
   - sudo apt update
   - sudo apt install python3-pip -y
   - conda create -n p310 -y python=3.10

setup_commands:
   - conda activate p310 && pip install -U pyarrow==16.0.0 "ray[all]==2.20.0" boto3==1.36.8 func_timeout gymnasium numpy==1.26.4 torch==2.5.1 tensorboard pygame==2.5.2 dm_tree scikit-image lz4

head_setup_commands: []

worker_setup_commands: 
   - mkdir -p /home/ubuntu/workspace/game/Build/Linux/

head_start_ray_commands:
   - conda activate p310 && ray stop
   - conda activate p310 && ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 

worker_start_ray_commands:
   - conda activate p310 && ray stop
   - conda activate p310 && ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I’m also working on migrating our stuff to 2.40 using new API since at least ray dashboard functions properly still. But experienced worker hanging issue where it seems like one worker node could have finished setup but still stuck in the ‘setting up’ phase causing the entire infrastructure to stuck in a loop of adding and removing nodes due to auto scale. Maybe something with the worker_start_ray_commands no longer works properly? (This setup has been fine the entire past year)