Hi, I am trying to run a Ray tune example on jupyter notebook given I have 2 gpus.
I am getting warnings and it start with
Tune Status
Current time: | 2024-02-26 18:41:09 |
---|---|
Running for: | 00:00:05.14 |
Memory: | 74.2/503.5 GiB |
System Info
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 2.000: None | Iter 1.000: None
Logical resource usage: 4.0/64 CPUs, 2.0/2 GPUs (0.0/1.0 accelerator_type:RTX)
and then changes to
Tune Status
Current time: | 2024-02-26 18:24:33 |
---|---|
Running for: | 00:00:36.27 |
Memory: | 74.2/503.5 GiB |
System Info
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 2.000: None | Iter 1.000: None
Logical resource usage: 4.0/0 CPUs, 2.0/0 GPUs
Trial Status
Trial name | status | loc | batch_size | lr |
---|---|---|---|---|
train_cifar_1d7e4_00000 | PENDING | 2 | 0.000481492 | |
train_cifar_1d7e4_00001 | PENDING | 8 | 0.000209733 |
(raylet) [2024-02-26 18:23:57,740 E 2903766 2903841] (raylet) agent_manager.cc:84: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335. (raylet) The raylet fate shares with the agent. This can happen because (raylet) - The version of grpcio
doesn’t follow Ray’s requirement. Agent can segfault with the incorrect grpcio
version. Check the grpcio version pip freeze | grep grpcio
. (raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log
. You can find the log file structure here Configuring Logging — Ray 3.0.0.dev0. (raylet) - The agent is killed by the OS (e.g., out of memory). 2024-02-26 18:24:07,726 WARNING resource_updater.py:262 – Cluster resources not detected or are 0. Attempt #2… 2024-02-26 18:24:08,228 WARNING resource_updater.py:262 – Cluster resources not detected or are 0. Attempt #3… 2024-02-26 18:24:08,730 WARNING resource_updater.py:262 – Cluster resources not detected or are 0. Attempt #4… 2024-02-26 18:24:09,232 WARNING resource_updater.py:262 – Cluster resources not detected or are 0. Attempt #5… 2024-02-26 18:24:09,734 WARNING resource_updater.py:275 – Cluster resources cannot be detected or are 0. You can resume this experiment by passing in resume=True
to run
. 2024-02-26 18:24:09,735 WARNING util.py:202 – The on_step_begin
operation took 2.009 s, which may be a performance bottleneck. 2024-02-26 18:24:09,838 WARNING resource_updater.py:262 – Cluster resources not detected or are 0. Attempt #2…
Till attempt 5.
These are version I have install in my system.
grpcio==1.60.1
ray==2.9.3