Logging in to GCP custom docker image

Hey there! I’m trying to setup a custom cluster on GCP with a head node and a GPU worker node.

I have a custom container hosted through GCP’s Artifact Registry that I’ve set up to install on the worker nodes. Unfortunately, getting these to run requires logging in and authenticating through GCP.

What I’ve tried out, therefore, is adding a docker login to the setup_commands using the command outlined here.

Here’s my config:

# A unique identifier for the head node and workers of this cluster.
cluster_name: gpu-cluster

# The maximum number of workers nodes to launch in addition to the head node.
max_workers: 1

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# Cloud-provider specific configuration.
provider:
  type: gcp
  region: us-west1
  availability_zone: us-west1-b
  project_id: research-410912
  use_internal_ips: false

docker:
  head_image: "rayproject/ray:latest" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
  worker_image: us-central1-docker.pkg.dev/research-410912/benchmark-worker/benchmark-worker:latest
  # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
  container_name: "ray_container"
  # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
  # if no cached version is present.
  pull_before_run: True

idle_timeout_minutes: 5

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
  ray_head_default:
    # The resources provided by this node type.
    resources: { "CPU": 0, "GPU": 0 }
    # Provider-specific config for the head node, e.g. instance type. By default
    # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
    # For more documentation on available fields, see:
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
    node_config:
      machineType: n1-standard-2
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 50
            # See https://cloud.google.com/compute/docs/images for more images
            sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

      # Additional options can be found in in the compute docs at
      # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

      # If the network interface is specified as below in both head and worker
      # nodes, the manual network config is used.  Otherwise an existing subnet is
      # used.  To use a shared subnet, ask the subnet owner to grant permission
      # for 'compute.subnetworks.use' to the ray autoscaler account...
      # networkInterfaces:
      #   - kind: compute#networkInterface
      #     subnetwork: path/to/subnet
      #     aliasIpRanges: []
  ray_worker_small:
    # The minimum number of worker nodes of this type to launch.
    # This number should be >= 0.
    min_workers: 1
    # The maximum number of worker nodes of this type to launch.
    # This takes precedence over min_workers.
    max_workers: 2
    # The resources provided by this node type.
    resources: { "CPU": 24, "GPU": 2 }
    # Provider-specific config for the head node, e.g. instance type. By default
    # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
    # For more documentation on available fields, see:
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
    node_config:
      machineType: a2-highgpu-2g
      metadata:
        items:
          - key: install-nvidia-driver
            value: "True"
      disks:
        - boot: true
          autoDelete: true
          type: PERSISTENT
          initializeParams:
            diskSizeGb: 250
            # See https://cloud.google.com/compute/docs/images for more images
            sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu
      # Run workers on preemtible instance by default.
      # Comment this out to use on-demand.
      scheduling:
        - preemptible: true
      # Un-Comment this to launch workers with the Service Account of the Head Node
      serviceAccounts:
        - email: ray-autoscaler-sa-v1@research-410912.iam.gserviceaccount.com
          scopes:
            - https://www.googleapis.com/auth/cloud-platform

  # Additional options can be found in in the compute docs at
  # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

file_mounts: { "/key": "./" }

setup_commands:
  - cat /key/ray_autoscaler.json | sudo docker login -u _json_key --password-stdin https://us-central1-docker.pkg.dev
  # - docker login -u _json_key --password-stdin https://us-central1-docker.pkg.dev < /key/ray_autoscaler.json

head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0


Unfortunately, running the command results in bash: docker: command not found.

I’ve tried cat /key/ray_autoscaler.json and docker info as separate commands in the setup_commands list and I can confirm that both work just fine - they output the desired info.

However, it’s when running that third command that all hell breaks loose. I’ve tried redirecting the key file using < /key/ray_autoscaler.json but then it can’t find the file…

Any help would be appreciated, I’ve been racking on this for a few hours now and I’d really not want to host this container on a public repository.