Error while trying to connec to ray cluster from docker

Mrityunjoy_Saha · July 10, 2021, 7:20pm

My example-full.yaml file :

cluster_name: default

docker:
    image: "rayproject/ray-ml:latest-gpu"
    container_name: "ray_container"
    disable_shm_size_detection: True
    pull_before_run: True
    run_options: []

provider:
    type: local
    head_ip: <host>
    worker_ips: [<host>:<port>]
   
auth:
    ssh_user: mrityunjoysaha
    # ssh_private_key: ~/.ssh/id_rsa

min_workers: 0

max_workers: 0
upscaling_speed: 1.0

idle_timeout_minutes: 5

file_mounts: {
}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: [conda install -c anaconda python=3.8]

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

Then I execute :

ray up example-full.yaml

My output :

Local node IP: <host>
2021-07-08 07:39:31,089	INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='<host>:6379' --redis-password='xxxx'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='xxxx')
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop
Shared connection to <host> closed.
2021-07-08 20:09:32,239	INFO node_provider.py:101 -- ClusterState: Writing cluster state: ['<host>:<port>', '<host>']
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /home/mrityunjoysaha/mrityunjoy/ray_cluster_testing/example-full.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /home/mrityunjoysaha/mrityunjoy/ray_cluster_testing/example-full.yaml
  Get a remote shell to the cluster manually:
    ssh -tt -o IdentitiesOnly=yes mrityunjoysaha@<host> docker exec -it ray_container /bin/bash

The i ran a python script written with fastapi and in docker, which starts like this :

@app.on_event("startup")
async def startup_event():
    ray.init(address='<host>:6379', _redis_password='xxxx')
    global client
    client = serve.start()

But it’s getting below error :

  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 526, in lifespan
    async for item in self.lifespan_context(app):
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 467, in default_lifespan
    await self.startup()
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 502, in startup
    await handler()
  File "./master.py", line 84, in startup_event
    ray.init(address='<host>:6379' , _redis_password='xxxx')
  File "/usr/local/lib/python3.8/dist-packages/ray/worker.py", line 759, in init
    _global_node = ray.node.Node(
  File "/usr/local/lib/python3.8/dist-packages/ray/node.py", line 176, in __init__
    ray._private.services.get_address_info_from_redis(
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/services.py", line 287, in get_address_info_from_redis
    return get_address_info_from_redis_helper(
  File "/usr/local/lib/python3.8/dist-packages/ray/_private/services.py", line 246, in get_address_info_from_redis_helper
    client_table = global_state.node_table()
  File "/usr/local/lib/python3.8/dist-packages/ray/state.py", line 323, in node_table
    node_info["Resources"] = self.node_resource_table(
  File "/usr/local/lib/python3.8/dist-packages/ray/state.py", line 284, in node_resource_table
    node_id = ray.NodeID(hex_to_binary(node_id))
  File "python/ray/includes/unique_ids.pxi", line 207, in ray._raylet.NodeID.__init__
  File "python/ray/includes/unique_ids.pxi", line 33, in ray._raylet.check_id
ValueError: ID string needs to have length 20

Can anyone please suggest what I might be doing wrong. Thanks in advance

Dmitri · July 15, 2021, 4:34am

looks like some ray internals are involved
any sense @sangcho ?

sangcho · July 15, 2021, 5:31pm

Hmm it is probably some kind of version mismatch. The ID string needs to be length 28 not 20 in the latest versions.

sangcho · July 15, 2021, 5:31pm

@Dmitri do you have any guess that makes this possible?

Dmitri · July 15, 2021, 7:00pm

I think some details on how the script was deployed could be helpful to find out where there might be an opportunity for version mismatches.

Topic		Replies	Views
Not able to ssh into head node during ray up Ray Clusters	3	1846	June 17, 2022
Ray head connects only outside Docker Ray Clusters	0	160	May 19, 2024
Some Issues When I Start My Ray Cluster in centos 7 Ray Clusters	4	599	January 28, 2022
Unable to start Ray cluster in GCP VM Ray Client	4	35	May 5, 2025
Worker nodes fail to setup container Ray Clusters	1	701	September 12, 2022

Error while trying to connec to ray cluster from docker

Related topics