Ray 2.3.1 (with [default])
I’m trying to launch a simple cluster with 1 worker I can ssh to:
cluster_name: default
provider:
type: local
head_ip: box1
worker_ips: [box2]
auth:
ssh_user:
min_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 1.0
file_mounts: {}
cluster_synced_files:
file_mounts_sync_continuously: False
initialization_commands:
setup_commands:
head_setup_commands:
worker_setup_commands:
head_start_ray_commands:
- source venv/bin/activate && ray stop && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --include-dashboard 1 --dashboard-host 0.0.0.0
worker_start_ray_commands:
- date > /home//RAY_DID_A_THING # just to check if anything ran on the worker
- source venv/bin/activate && ray stop && ray start --address=$RAY_HEAD_IP:6379`
Note that venv is accessible on both machines.
I’m encountering the following issues, running on the head node:
- When I try to run
ray down -vvvvvvvv -y ray-config.yaml
without first runningray stop
, it hangs forever with no information:
$ ray down -vvvvvvvv -y ray-config.yaml
Loaded cached provider configuration from /tmp/ray-config-f218cd32e988d445b2bd4b3e6b2682e034aec662
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
This is extra confusing because --no-config-cache
is not an option for ray down
!
- When I
ray stop
and try toray up -vvvvvvvvv -y --no-config-cache ray-config.yaml
again, it does not connect to the worker. I can confirm this by checking the dashboard, checking the nonexistence of RAY_DID_A_THING, or trying to run a job. If I run a job, it hangs and ultimately fails with
TimeoutError: Placement group creation timed out after 100 seconds. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {...box1...}
=============================
Am I doing something wrong? Might this be a bug with Ray?