RayExecutor.start() hangs

Camazaqu · June 12, 2022, 8:19am

Hi, im using this image: horovod/horovod-ray:0.24.3 on 3 machines deployed with this:

version: "3.8"

services:
  ray-head:
    image: horovod/horovod-ray:0.24.3
    ports:
      - "6379:6379"
      - "8265:8265"
      - "10001:10001"
    command: bash -c "ray start --head --dashboard-port=8265 --port=6379 --dashboard-host=0.0.0.0 --redis-password=passwd --block"
    shm_size: 2g
    deploy:
      placement:
        constraints:
          - "node.role==manager"
      resources:
        limits:
          cpus: '1'
          memory: '2g'
    networks:
      - ray_net
  ray-worker:
    image: horovod/horovod-ray:0.24.3
    ports:
      - "9500:9500"
    depends_on:
      - ray-head
    command: bash -c "ray start --address=ray-head:6379 --redis-password=passwd --num-cpus=2 --block"
    shm_size: 2g
    deploy:
      replicas: 2
      placement:
        constraints:
          - "node.role==worker"
      resources:
        limits:
          cpus: '2'
          memory: '2g'
    networks:
      - ray_net

networks:
  ray_net:

it works properly until I call executor.start()

Sometimes I get the error that I have 5 CPU on cluster but was required {CPU: 1, CPU: 1}
Any idea? I had same problem with kuberay example

yic · June 13, 2022, 5:54pm

@amogkam could you take a look at this question? I’m not quite sure about how internally, horovod-ray is working.

amogkam · June 16, 2022, 5:55pm

Hey @Camazaqu would you be able to share the code that you are using as well as the full error message?

Topic		Replies	Views
Horovod Trainer hangs Ray Train	5	599	November 3, 2023
How to analysis or debug the connecting procedure	8	902	March 23, 2021
Deploy ray cluster and access it Ray Clusters	7	803	July 8, 2022
Worker node workers/cores aren't not working	1	596	May 2, 2022
Ray hangs in 2 different places, fails to launch anything on workers in ssh mode Ray Clusters	0	368	April 21, 2023

RayExecutor.start() hangs

Related topics