My example-full.yaml file :
cluster_name: default
docker:
image: "rayproject/ray-ml:latest-gpu"
container_name: "ray_container"
disable_shm_size_detection: True
pull_before_run: True
run_options: []
provider:
type: local
head_ip: <host>
worker_ips: [<host>:<port>]
auth:
ssh_user: mrityunjoysaha
# ssh_private_key: ~/.ssh/id_rsa
min_workers: 0
max_workers: 0
upscaling_speed: 1.0
idle_timeout_minutes: 5
file_mounts: {
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands: [conda install -c anaconda python=3.8]
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
Then I execute :
ray up example-full.yaml
My output :
Local node IP: <host>
2021-07-08 07:39:31,089 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<host>:6379' --redis-password='xxxx'
Alternatively, use the following Python code:
import ray
ray.init(address='auto', _redis_password='xxxx')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
Shared connection to <host> closed.
2021-07-08 20:09:32,239 INFO node_provider.py:101 -- ClusterState: Writing cluster state: ['<host>:<port>', '<host>']
New status: up-to-date
Useful commands
Monitor autoscaling with
ray exec /home/mrityunjoysaha/mrityunjoy/ray_cluster_testing/example-full.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Connect to a terminal on the cluster head:
ray attach /home/mrityunjoysaha/mrityunjoy/ray_cluster_testing/example-full.yaml
Get a remote shell to the cluster manually:
ssh -tt -o IdentitiesOnly=yes mrityunjoysaha@<host> docker exec -it ray_container /bin/bash
The i ran a python script written with fastapi and in docker, which starts like this :
@app.on_event("startup")
async def startup_event():
ray.init(address='<host>:6379', _redis_password='xxxx')
global client
client = serve.start()
But it’s getting below error :
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 526, in lifespan
async for item in self.lifespan_context(app):
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 467, in default_lifespan
await self.startup()
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 502, in startup
await handler()
File "./master.py", line 84, in startup_event
ray.init(address='<host>:6379' , _redis_password='xxxx')
File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 759, in init
_global_node = ray.node.Node(
File "/usr/local/lib/python3.8/dist-packages/ray/node.py", line 176, in __init__
ray._private.services.get_address_info_from_redis(
File "/usr/local/lib/python3.8/dist-packages/ray/_private/services.py", line 287, in get_address_info_from_redis
return get_address_info_from_redis_helper(
File "/usr/local/lib/python3.8/dist-packages/ray/_private/services.py", line 246, in get_address_info_from_redis_helper
client_table = global_state.node_table()
File "/usr/local/lib/python3.8/dist-packages/ray/state.py", line 323, in node_table
node_info["Resources"] = self.node_resource_table(
File "/usr/local/lib/python3.8/dist-packages/ray/state.py", line 284, in node_resource_table
node_id = ray.NodeID(hex_to_binary(node_id))
File "python/ray/includes/unique_ids.pxi", line 207, in ray._raylet.NodeID.__init__
File "python/ray/includes/unique_ids.pxi", line 33, in ray._raylet.check_id
ValueError: ID string needs to have length 20
Can anyone please suggest what I might be doing wrong. Thanks in advance