1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.20
- Python version: 3.10
- OS: linux
- Cloud/Infrastructure: aws
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
-
Expected:
I have been training on ray 2.20 and this yaml config for over a year now since we are on the old ModelV2 API and everything is stable. -
Actual:
This Monday (June 2nd) as usual I tried to kick off a new experiment to my old ray cluster from last week and all the workers will finish their setup and immediately went away as soon as they are ready to get added to the cluster. I looked at the log and it says The version of grpcio doesn’t follow Ray’s requirementI tried to spin up a new cluster under ray 2.20 and then similarly I don’t even have my headnode in the cluster dashboard anymore. Though on EC2 it’s still running, and I can do ray down yaml config to close them all.
Since seriously nothing changed I went to look at this grpc error and found that grpcio launched a new version that my new nodes are using and suspect that’s the issue. However I couldn’t really solve it by just saying use an older version of grpcio, I think ray pulls it behind the back no matter what, and I couldn’t get pip install ray --no-dep to work properly after manually installing all the dependencies i can find.
Does this look like a ray issue? Is there any other library that’s pulling it?
Here is roughly the yaml I use
cluster_name: hy1
max_workers: 50
upscaling_speed: 1.0
idle_timeout_minutes: 6
provider:
type: aws
region: us-west-1
availability_zone: us-west-1a,us-west-1b
cache_stopped_nodes: False
auth:
ssh_user: ubuntu
ssh_private_key:
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
ImageId:
KeyName: ray-cluster
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 140
VolumeType: gp3
resources: {"CPU": 0}
16_cpu_learner:
min_workers: 0
max_workers: 1
node_config:
InstanceType: m7i.4xlarge
ImageId:
KeyName: ray-cluster
InstanceMarketOptions: {}
48_cpu_worker:
min_workers: 0
max_workers: 20
node_config:
InstanceType: c7i.12xlarge
ImageId:
KeyName: ray-cluster
InstanceMarketOptions: {}
head_node_type: ray.head.default
cluster_synced_files: []
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands:
- sudo apt update
- sudo apt install python3-pip -y
- conda create -n p310 -y python=3.10
setup_commands:
- conda activate p310 && pip install -U pyarrow==16.0.0 "ray[all]==2.20.0" boto3==1.36.8 func_timeout gymnasium numpy==1.26.4 torch==2.5.1 tensorboard pygame==2.5.2 dm_tree scikit-image lz4
head_setup_commands: []
worker_setup_commands:
- mkdir -p /home/ubuntu/workspace/game/Build/Linux/
head_start_ray_commands:
- conda activate p310 && ray stop
- conda activate p310 && ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
worker_start_ray_commands:
- conda activate p310 && ray stop
- conda activate p310 && ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
I’m also working on migrating our stuff to 2.40 using new API since at least ray dashboard functions properly still. But experienced worker hanging issue where it seems like one worker node could have finished setup but still stuck in the ‘setting up’ phase causing the entire infrastructure to stuck in a loop of adding and removing nodes due to auto scale. Maybe something with the worker_start_ray_commands no longer works properly? (This setup has been fine the entire past year)