Training keeps getting stuck

I am running on an M1 Mac Pro (32 GB ram, 10 CPU) if it matters. Ray RLLib code

When I run experiments sometimes they get stuck. In a training loop, I won’t get a print statement to say the loop is finished. I have tried new python environments, restarting, different Gym environments, different config Algorithms, etc, but the issue persists. I am using very basic code so I don’t understand what could be wrong. Here are two versions that both have shown this behavior.

# import statements for both codes
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.algorithms.a3c import A3CConfig
import time
import ray
import tensorflow as tf
import numpy as np
import ray.tune as tune
from ray import air

env_name = "MountainCar-v0"
    ray.init()
    num_rollout_workers = 7

    config = ( 
        DQNConfig()
        .environment(env_name)
        .rollouts(num_envs_per_worker=1, num_rollout_workers=num_rollout_workers)
        .framework("tf2")
        .training(model={"fcnet_hiddens": [32, 32], "fcnet_activation": 'relu'}, train_batch_size=512\
                  , lr=0.0001, gamma=0.95)
        .evaluation(evaluation_num_workers=1, evaluation_duration=300, evaluation_duration_unit='timesteps')
    )

    run_tuner = tune.Tuner("DQN"
               , run_config=air.RunConfig(stop={"episode_reward_mean": -120, "training_iteration": 3})
               , param_space=config
               )

    results = run_tuner.fit()

and

    ray.init()
    num_rollout_workers = 7
    num_iters = 50

    config = (  # 1. Configure the algorithm,
        DQNConfig()
        .environment(env_name)
        .rollouts(num_envs_per_worker=1, num_rollout_workers=num_rollout_workers)
        .framework("tf2")
        .training(model={"fcnet_hiddens": [32, 32], "fcnet_activation": 'relu'}, train_batch_size=512\
                  , lr=tune.grid_search([0.001,0.0001, 0.00001]), gamma=0.95)
        .evaluation(evaluation_num_workers=1, evaluation_duration=300, evaluation_duration_unit='timesteps')
    )

    algo = config.build()  # 2. build the algorithm,
    for _ in range(num_iters):
        algo.train()  # 3. train it,
        print(f"Step {_} done")
        if _ % 50 == 0:
            print(algo.evaluate())

Is there a bug in my code? Are there logs I can check? Or is there some setting that allows me to check if a step is taking a long time; and if it is, to skip that step, reset the environments, and then continue training?

Here are a few main details from my python env:
ray==2.4.0
python==3.8.10 #via conda-forge
tensorflow-macos==2.12.0
tensorflow-metal==0.8.0
grpcio==1.49.1 # via conda-forge as install instructions recommend. Also used 1.49.1 since tf requires a later version, but have tried downgrading tf and grpcio as well.
gymnasium==0.26.3

When I go to tensorflow, it also shows that the system is stuck somewhere as new results are not being added. I have left the system going overnight without a single new iteration.

We’ll try to reproduce this on our end. Could you try using fewer rollout workers e.g. 4? Could you also try cloning the ray repository from github and run rllib train file rllib/tuned_examples/dqn/cartpole-dqn.yaml from the root of the Ray repo?

Your example runs and completes fine. It seemed random when the issue would happen during training in that I would see it complete sometimes 3, sometimes 100 iterations and then get stuck. I’m not sure if it is just the mountain car env that does this or not. I don’t recall seeing it on others. However, I also haven’t seen it happen in a while (I also have been working on a different project so haven’t used much rllib in the past few weeks.)

Ah, actually, a possible explanation is that you’re running out of cpus: you’ll have 1 local worker for the Tune process, which, for every trial with a specific set of hyperparameters, launches 1 RLlib learner process, 7 rollout workers, and 1 evaluation worker, so 11 workers overall. In cases like this, Ray will time out waiting for the system to provide more resources than it can. Try decreasing num_rollout_workers to 6 and see if the issue reoccurs.

Edit: for your second example, I don’t believe you can use tune.grid_search with the raw algo.train functionality, only with tuner = Tuner(algo, ...); tuner.fit(). I’m not sure what the behavior of tune.grid_search outside the tune.Tuner context.