I am trying to deploy Ray to Kubernetes using a custom Docker image. I cannot use any Ray base Docker images for my use case, so I need to install Ray in my own image. This is made more complicated by the fact that I am using a custom Python interpreter wrapped in a shell script. I am able to run ray
using this pattern in my container:
/path/to/python.sh /path/to/python/bin/ray start|stop|etc.
My question is how I can use this pattern with a RayJob. I found that I can set the command for my worker and head nodes directly:
containers:
- name: head
image: my-custom-image
command:
[
"/path/to/python.sh",
"/path/to/python/bin/ray",
"start",
"--head",
]
...
containers:
- name: worker
image: my-custom-image
command:
[
"/path/to/python.sh",
"/path/to/python/bin/ray",
"start",
"--address='$(RAY_HEAD_SERVICE_HOST):6379'",
]
lifecycle:
preStop:
exec:
command:
[
"/bin/sh",
"-c",
"/path/to/python.sh",
"/path/to/python/bin/ray",
"stop",
]
With this I am able to start the cluster, but when I inspect the logs for my head node, I see this:
2024-07-09 13:34:39,326 INFO scripts.py:767 -- Local node IP: 10.0.16.61
2024-07-09 13:34:43,261 SUCC scripts.py:804 -- --------------------
2024-07-09 13:34:43,261 SUCC scripts.py:805 -- Ray runtime started.
2024-07-09 13:34:43,261 SUCC scripts.py:806 -- --------------------
2024-07-09 13:34:43,261 INFO scripts.py:808 -- Next steps
2024-07-09 13:34:43,261 INFO scripts.py:811 -- To add another node to this Ray cluster, run
2024-07-09 13:34:43,261 INFO scripts.py:814 -- ray start --address='10.0.16.61:6379'
2024-07-09 13:34:43,261 INFO scripts.py:823 -- To connect to this Ray cluster:
2024-07-09 13:34:43,262 INFO scripts.py:825 -- import ray
2024-07-09 13:34:43,262 INFO scripts.py:826 -- ray.init()
2024-07-09 13:34:43,262 INFO scripts.py:838 -- To submit a Ray job using the Ray Jobs CLI:
2024-07-09 13:34:43,262 INFO scripts.py:839 -- RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
2024-07-09 13:34:43,262 INFO scripts.py:848 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
2024-07-09 13:34:43,262 INFO scripts.py:852 -- for more information on submitting Ray jobs to the Ray cluster.
2024-07-09 13:34:43,262 INFO scripts.py:857 -- To terminate the Ray runtime, run
2024-07-09 13:34:43,262 INFO scripts.py:858 -- ray stop
2024-07-09 13:34:43,262 INFO scripts.py:861 -- To view the status of the cluster, use
2024-07-09 13:34:43,262 INFO scripts.py:862 -- ray status
2024-07-09 13:34:43,262 INFO scripts.py:866 -- To monitor and debug Ray, view the dashboard at
2024-07-09 13:34:43,262 INFO scripts.py:867 -- 127.0.0.1:8265
2024-07-09 13:34:43,262 INFO scripts.py:874 -- If connection to the dashboard fails, check your firewall settings and network configuration.
/bin/bash: line 1: ray: command not found
The last line causes the container to fail. My guess is that I have set the start and stop commands correctly but there are other processes running that also need to access the ray executable and cannot find it. Is there a way I can set the path in the RayJob configuration? Or do I need to modify my base Docker image in some way?