@Alex thanks for coming back to this. I am struggling a lot with running my workloads on the cloud and especially on mulit-node clusters.
The custom machines I tried out were a custom-8-32768
, so 8 Cores and 32GB memory and the one mentioned in the topic, a custom-16-53248
with 16 Cores and 52GB memory. The former one does not throw an error with a single GPU and the latter one throws an error even with a single GPU. While both throw an error when setting the GPU counter larger than 1.
I also tried to use several GPU workers instead, but there the problem is that there are major stability difficulties:
- It takes a very long time (sometimes up to an hour) to set up the worker nodes (switches between
setting-up
, launching
, waiting-for-ssh
- head node works smoothly)
- It appears that even two neural networks with 3 Conv layers and three FC layers run very slowly (after > 8h I checked via TensorBoard and had 4k and 12k timesteps with PPO each - already paid >70€)
- Syncing to a bucket takes a very long time (and in RL it is important to keep many checkpoints and also videos to controll for learning)
- If nodes are preemptible a run somtimes just errors out but does not automatically restart with a new node
- Setting up new nodes takes again forever when another one was preemptied
Altogether I have lost a lot of time and money so far and I am a little reluctant in using ray on clusters (I am also missing more guidelines in regard to cloud computing and workload distribution).
I ran ray up -y --verbose <my.yaml>
and this is the output I got:
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
Cluster: gpu-docker
2022-04-25 10:18:58,456 INFO util.py:335 -- setting max workers for head node type to 0
Checking GCP environment settings
2022-04-25 10:19:02,792 INFO config.py:485 -- _configure_key_pair: Private key not specified in config, using/home/simon/.ssh/ray-autoscaler_gcp_us-west1_forex-328615_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Traceback (most recent call last):
File "/home/simon/git-projects/ray-experiments/.venv/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2269, in main
return cli()
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
return f(*args, **kwargs)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1158, in up
create_or_update_cluster(
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 276, in create_or_update_cluster
get_or_create_head_node(
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 714, in get_or_create_head_node
provider.create_node(head_node_config, head_node_tags, 1)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 47, in method_with_retries
return method(self, *args, **kwargs)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 174, in create_node
resource.create_instances(base_config, labels, count)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 280, in create_instances
operations = [
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 281, in <listcomp>
self.create_instance(base_config, labels, wait_for_operation=False)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 511, in create_instance
self.resource.instances()
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
return wrapped(*args, **kwargs)
File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 937, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/forex-328615/zones/us-west1-b/instances?alt=json returned "[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.". Details: "[{'message': "[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.", 'domain': 'global', 'reason': 'badRequest'}]">
Thank you for taking a look, Alex. I really appreciate this