Distributing docker image.tar to worker nodes as a ray job

  • High: Need this to work to proceed further
    Hi Ray Community,
    I am starting new with ray for a project and have stumbled upon this requirement.

Only the head node has the internet and network access and all the worker nodes are connected locally to the head node with no internet access. So I cannot docker pull on the worker nodes. I want to distribute the docker image.tar file to all the worker nodes and call docker load < image.tar locally on each of the worker node using docker python SDK.
Copying the tar files to the worker nodes is not an option due to storage limitations on the worker nodes.
Head node: Ubuntu 20.04 (x86_64)
Worker nodes: Ubuntu 18.04 (aarch64)
My solution with just one worker node:
I tried to load the image.tar file handle to the object store on the head node and pass the object ref.
worker_prep_job.py

import ray
import docker
@ray.remote(num_cpus=1)
class Workerprep:
    def __init__(self):
        self.dockerClient = docker.from_env()
    def loadDockerImage(self, img):
        self.dockerClient.images.load(img)
    def startDockerContainer(self):
        self.dockerClient.containers.run(image='arm64v8/python:latest',
                                         auto_remove=True,                                    
                                         detach=True,
                                         name='target',
                                         stdin_open=True,
                                         tty=True
                                         )

ray_master.py

import ray
import worker_prep_job as job1
def main():
    ray.init()
    print(ray.cluster_resources())
    worker_job = job1.Workerprep.remote()
    with open('/home/pca/clusterServer/python.tar', 'rb') as f:
        image_ref = ray.put(f)
        ray.get(worker_job.loadDockerImage.remote(image_ref))
    ray.get(worker_job.startDockerContainer.remote())
if __name__ == "__main__":
    main()

console output:

root@pca2:~# /home/pca/raydev/bin/python /home/pca/clusterServer/ray_master.py
2023-08-15 18:25:23,664 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.0.2.15:6379...
2023-08-15 18:25:23,675 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
{'CPU': 6.0, 'node:192.168.0.69': 1.0, 'memory': 6057658369.0, 'object_store_memory': 2849485209.0, 'node:10.0.2.15': 1.0, 'node:__internal_head__': 1.0}
Traceback (most recent call last):
  File "/home/pca/clusterServer/ray_master.py", line 13, in <module>
    main()
  File "/home/pca/clusterServer/ray_master.py", line 10, in main
    ray.get(worker_job.loadDockerImage.remote(image_ref))
  File "/home/pca/raydev/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/pca/raydev/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/pca/raydev/lib/python3.8/site-packages/ray/_private/worker.py", line 2493, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::Workerprep.loadDockerImage() (pid=7680, ip=192.168.0.69, actor_id=85ac855724338050fd3fe3240f000000, repr=<worker_prep_job.Workerprep object at 0x7f7c72aca0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0f00000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*0f000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.2.15) for more information about the Python worker failure.

Any help is appreciated. Thanks in advance.

What’s happening if you don’t import Workerprep from job1, but instead you define this class in the ray_master.py? Were you still able to reproduce the issue?