What’s the protocol for using ray/rllib on an HPC?
I don’t need to have different workers spun up by the HPC talking to each other. Instead, I am just planning to run my script many times using resources requisitioned from the cluster. But each run should be a completely independent experiment each using say 8 CPU cores and a GPU.
It is my understanding that when slurm finds me the resources I ask for, that when I am in that setting and run a command it is as if that were a single computer. Is that incorrect?
I got the following error when logged directly into one of the GPU nodes and attempting to run code that works on a personal computer. Also note, that in the ray.init call I already have include_dashboard=False:
/fs01/home/aadharna/mapo/v2_rllib/c4_splitnet_shaped_selfplay.py:50: DeprecationWarning: The module `ray.air.callbacks.wandb` has been moved to `ray.air.integrations.wandb` and the old location will be deprecated soon. Please adjust your imports to point to the new location. Example: Do a global search and replace `ray.air.callbacks.wandb` with `ray.air.integrations.wandb`.
from ray.air.callbacks.wandb import WandbLoggerCallback
2023-08-23 16:55:23,137 ERROR services.py:1169 -- Failed to start the dashboard
2023-08-23 16:55:23,717 ERROR services.py:1194 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-08-23 16:55:23,896 ERROR services.py:1238 --
The last 20 lines of /tmp/ray/session_2023-08-23_16-54-26_533740_24277/logs/dashboard.log (it contains the error message from the dashboard):
from opencensus.common.transports import sync
File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/common/transports/sync.py", line 16, in <module>
from opencensus.trace import execution_context
File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/__init__.py", line 15, in <module>
from opencensus.trace.span import Span
File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/span.py", line 32, in <module>
from opencensus.trace import status as status_module
File "/h/aadharna/.local/lib/python3.9/site-packages/opencensus/trace/status.py", line 15, in <module>
from google.rpc import code_pb2
File "/pkgs/anaconda39/lib/python3.9/site-packages/google/rpc/code_pb2.py", line 47, in <module>
_descriptor.EnumValueDescriptor(
File "/h/aadharna/condaenvs/pytorch2/google/protobuf/descriptor.py", line 796, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
2023-08-23 16:55:49,325 INFO worker.py:1553 -- Started a local Ray instance.
[2023-08-23 16:56:13,568 E 24277 6563] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: .Please see `raylet.out` for more details.