Ray / gRPC Ambiguous Error Message

I’m not sure if this is what ultimately fixed the problem, but I prepended export AUTOSCALER_MAX_NUM_FAILURES=inf; to my Ray head start commands in the raycluster.yaml file (as suggested here: Cluster Deployment Guide — Ray 1.12.0) and I was ultimately able to finish my job successfully. (I also upgraded from Ray version 1.11.0 → 1.12.0).

I had seen autoscaler failure / error messages in the Ray operator logs for the last job I’d run that previously failed with this gRPC error message, so there’s some reason to believe this might have been the solution.