Using ray/rllib on an HPC

What’s the protocol for using ray/rllib on an HPC?

I don’t need to have different workers spun up by the HPC talking to each other. Instead, I am just planning to run my script many times using resources requisitioned from the cluster. But each run should be a completely independent experiment each using say 8 CPU cores and a GPU.

It is my understanding that when slurm finds me the resources I ask for, that when I am in that setting and run a command it is as if that were a single computer. Is that incorrect?

I got the following error when logged directly into one of the GPU nodes and attempting to run code that works on a personal computer. Also note, that in the ray.init call I already have include_dashboard=False:

/fs01/home/aadharna/mapo/v2_rllib/c4_splitnet_shaped_selfplay.py:50: DeprecationWarning: The module `ray.air.callbacks.wandb` has been moved to `ray.air.integrations.wandb` and the old location will be deprecated soon. Please adjust your imports to point to the new location. Example: Do a global search and replace `ray.air.callbacks.wandb` with `ray.air.integrations.wandb`.
  from ray.air.callbacks.wandb import WandbLoggerCallback
2023-08-23 16:55:23,137 ERROR services.py:1169 -- Failed to start the dashboard
2023-08-23 16:55:23,717 ERROR services.py:1194 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-08-23 16:55:23,896 ERROR services.py:1238 --
The last 20 lines of /tmp/ray/session_2023-08-23_16-54-26_533740_24277/logs/dashboard.log (it contains the error message from the dashboard):
    from opencensus.common.transports import sync
  File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/common/transports/sync.py", line 16, in <module>
    from opencensus.trace import execution_context
  File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/__init__.py", line 15, in <module>
    from opencensus.trace.span import Span
  File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/span.py", line 32, in <module>
    from opencensus.trace import status as status_module
  File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/status.py", line 15, in <module>
    from google.rpc import code_pb2
  File "/pkgs/anaconda39/lib/python3.9/site-packages/google/rpc/code_pb2.py", line 47, in <module>
    _descriptor.EnumValueDescriptor(
  File "/h/aadharna/condaenvs/pytorch2/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
2023-08-23 16:55:49,325 INFO worker.py:1553 -- Started a local Ray instance.
[2023-08-23 16:56:13,568 E 24277 6563] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:  .Please see `raylet.out` for more details.

As the error says, you need to correct the installed protobuf version.
For example, the current Ray release requires protobuf==3.19.6 as stated by the requirements files.
If you run a command of a node that is managed by SLURM, by default that node will be part of the Ray cluster and nothing else. In this case, when running an RLlib Algorithm or Tune experiment on that node, it will not interfere with other nodes.

You can, however, attach other nodes to the same cluster. As soon as that happens, you don’t have control over when/where workers are scheduled (unless you take additional steps around defining custom resources).

So if I just submit a single sbatch file to slurm for each experiment, then they’ll all be disjoint and when I have my ray.init call in each script each script will only see the resources that slurm has given that particular job. That’s great to know.

I just did a manual pip install -U protobuf==3.19.6 but that error still persists and when I check pip list it’s still showing protobuf to be above that version.

I’ll reach out the cluster people and see how they recommend I fix my environment. Once every thing is up and running, I’ll close discuss post.

Thanks!

Just to make sure: Even if you are working on a cluster, you need to make sure that the python environment on every node is set up appropriately. If you have two nodes in a Ray cluster and one has protobuf==3.19.6installed and one has protobuf==3.21.0 installed, one will error out.

The plan is for every ray cluster to be independent and all to use the same python environment once that’s set up correctly. I’m not planning to do any multi-node things – each ray cluster will be a single node with whatever resources SLURM has provided for that run.

But I’ve passed that along in my talks with the cluster support people and will keep it in mind.

Okay. So, after updating to where the latest pip install is at, we think that it’s the dashboard that’s causing issues.

However, you can see here that when I call ray.init, i am passing in include_dahsboard = False.

What is the correct way to handle this and how can I turn this off?

image.png

From one of the admins: “If the dashboard is trying to bind to an IP/port on the compute node, then chances are it will not be allowed since you do not have root privileges.”