Crash w/ grpcio error but version matches ray's req

Running jobs on a slurm ray cluster with jobs crashing randomly. I’ve had exact same jobs finish successfully.
pip freeze | grep grpcio
pip freeze | grep ray

I was getting the same error also with grpcio==1.59.0 and then I read the req’s and downgraded version but problem is still happening although it feels less frequently.

When they do crash I get the following error:

(raylet) The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335.
ESC[2mESC[33m(raylet)ESC[0m The raylet fate shares with the agent. This can happen because
ESC[2mESC[33m(raylet)ESC[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
ESC[2mESC[33m(raylet)ESC[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here


Nevermind. looks like it was an oom issue on some of the nodes and had nothing to do with grpcio.