We are running Ray on AWS EC2(Linux-Ubuntu) and attempting to integrate CloudWatch into our setup. However, after updating our YAML template and running ray up, the Ray head node is unable to launch any worker instances. The workers remain stuck in a pending state with the message: “waiting-for-ssh”.
Interestingly, as soon as we remove the CloudWatch configuration from the provider section and rerun ray up, everything connects without any issues.
Has anyone encountered this issue before? We’d love to hear from those who have successfully implemented CloudWatch in their Ray cluster. Any insights or solutions would be greatly appreciated. Thanks!
Since it connects okay without issues once Cloudwatch is removed I’m guessing it’s an issue with the integration between Cloudwatch <> Ray <> AWS?
Does your Ray cluster have the proper IAM permissions from AWS to talk to the Cloudwatch instance? Mostly create/write permissions to allow nodes to log to CloudWatch.
Are there any network or firewall restrictions to Cloudwatch?
Does the security group related with Ray allow ssh access and the key pairing is working?
Are there any error messages or is it just stuck with waiting-for-ssh?
Also, is there any code where we can reproduce this issue? Can you paste your updated YAML config (make sure you censor out any sensitive info tho)!
Here’s a few other folks who have run into this issue (albeit not with CloudWatch specifically), maybe it can help debug. Sorry I wasn’t more help!