Multiple GPU head node on GCP

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi guys,

I am running my RLlib experiments via tune.run() on GCP using the autoscaler script providing in autoscaler/gcp (the one with GPU). I run only on a single node (head node) and what I would like to do is to choose multiple GPUs on the head node.

However, setting resources: {"CPU": 16, "GPU": 2} under ray-head-gpu gives me the following error:

googleapiclient.errors.HttpError: 
<HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/forex-328615/zones/us-west1-b/instances?alt=json returned 
"[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.". 
Details: "[{'message': "[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.", 'domain': 'global', 'reason': 'badRequest'}]">

I use a custom machine type custom-16-53248.

Is it even possible to run the head node with multiple GPUs attached?

Best,
Simon

@Alex , do you have an idea ?

@Lars_Simon_Zehnder do you mind sharing more details about the custom machine type you’re using?

In case this is a bug in Ray, just a few additional questions:

  1. Does is work if you don’t explicitly set the resources field (those will be autodetected anyways and I suspect it’s unrelated but want to double check)?
  2. Do you mind rerunning ray up --verbose cluster.yaml which may provide additional details?

@Alex thanks for coming back to this. I am struggling a lot with running my workloads on the cloud and especially on mulit-node clusters.

The custom machines I tried out were a custom-8-32768, so 8 Cores and 32GB memory and the one mentioned in the topic, a custom-16-53248 with 16 Cores and 52GB memory. The former one does not throw an error with a single GPU and the latter one throws an error even with a single GPU. While both throw an error when setting the GPU counter larger than 1.

I also tried to use several GPU workers instead, but there the problem is that there are major stability difficulties:

  • It takes a very long time (sometimes up to an hour) to set up the worker nodes (switches between setting-up, launching, waiting-for-ssh - head node works smoothly)
  • It appears that even two neural networks with 3 Conv layers and three FC layers run very slowly (after > 8h I checked via TensorBoard and had 4k and 12k timesteps with PPO each - already paid >70€)
  • Syncing to a bucket takes a very long time (and in RL it is important to keep many checkpoints and also videos to controll for learning)
  • If nodes are preemptible a run somtimes just errors out but does not automatically restart with a new node
  • Setting up new nodes takes again forever when another one was preemptied

Altogether I have lost a lot of time and money so far and I am a little reluctant in using ray on clusters (I am also missing more guidelines in regard to cloud computing and workload distribution).

I ran ray up -y --verbose <my.yaml> and this is the output I got:

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
Cluster: gpu-docker

2022-04-25 10:18:58,456	INFO util.py:335 -- setting max workers for head node type to 0
Checking GCP environment settings
2022-04-25 10:19:02,792	INFO config.py:485 -- _configure_key_pair: Private key not specified in config, using/home/simon/.ssh/ray-autoscaler_gcp_us-west1_forex-328615_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
Traceback (most recent call last):
  File "/home/simon/git-projects/ray-experiments/.venv/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2269, in main
    return cli()
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1158, in up
    create_or_update_cluster(
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 276, in create_or_update_cluster
    get_or_create_head_node(
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 714, in get_or_create_head_node
    provider.create_node(head_node_config, head_node_tags, 1)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 47, in method_with_retries
    return method(self, *args, **kwargs)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 174, in create_node
    resource.create_instances(base_config, labels, count)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 280, in create_instances
    operations = [
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 281, in <listcomp>
    self.create_instance(base_config, labels, wait_for_operation=False)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 511, in create_instance
    self.resource.instances()
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/simon/git-projects/ray-experiments/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 937, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/forex-328615/zones/us-west1-b/instances?alt=json returned "[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.". Details: "[{'message': "[1-8] vCpus can be used along with 1 accelerator cards of type 'nvidia-tesla-k80' in an instance.", 'domain': 'global', 'reason': 'badRequest'}]">

Thank you for taking a look, Alex. I really appreciate this