RL training stuck when using Ray Tune and GPU

steff · January 23, 2023, 12:48am

Trying to evaluate and compare performance when training an RL model using Ray Tune using one AWS instance that contains 1GPU and another instance that contains 4GPUs. However, it appears training gets stuck on the first sample and doesn’t finish. Console contains no warnings or errors.

Same code works fine when using only CPUs.

== Status ==
Current time: 2023-01-23 00:44:50 (running for 01:25:53.07)
Memory usage on this node: 16.1/59.9 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 1.000: None
Resources requested: 4.0/8 CPUs, 1.0/1 GPUs, 0.0/34.66 GiB heap, 0.0/17.33 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/ubuntu/ray_results/AIRDQN_2023-01-22_23-18-57
Number of trials: 2/40 (1 PENDING, 1 RUNNING)
+-----------------+----------+---------------------+-------------------+-------------+------------------------+-----------------------+----------------------+--------------------+
| Trial name      | status   | loc                 | batch_mode        |          lr | model/fcnet_activati   | model/fcnet_hiddens   | observation_filter   |   train_batch_size |
|                 |          |                     |                   |             | on                     |                       |                      |                    |
|-----------------+----------+---------------------+-------------------+-------------+------------------------+-----------------------+----------------------+--------------------|
| AIRDQN_2574d0b0 | RUNNING  | 10.101.11.168:15888 | complete_episodes | 2.93357e-06 | elu                    | (128, 8)              | MeanStdFilter        |               5000 |
| AIRDQN_305150c6 | PENDING  |                     | truncate_episodes | 5.45804e-05 | relu                   | (64, 16)              | NoFilter             |               5000 |
+-----------------+----------+---------------------+-------------------+-------------+------------------------+-----------------------+----------------------+--------------------+

Settings:
RL algorithm: ray.rllib.algorithms.dqn.DQN
Input data type: offline data
Search algorithm: ray.tune.search.optuna.OptunaSearch
Scheduler: ray.tune.schedulers.ASHAScheduler
Off-policy evaluation method: ray.rllib.offline.estimators.DoublyRobust

The following shows the code used to create and configure RLTrainer along with other artifacts used in model training and off-policy evaluation process. Do I need to configure the ScalingConfig object differently for this to work with one and multiple GPUs other than simply setting use_gpu=True?

def train(self):

    # load train data
    train_dataset = ray.data.read_json(self.offline_data_info['train_dir'])

    # create trainer
    trainer = self.create_trainer(train_dataset=train_dataset)

    # search algorithm
    search_algo = OptunaSearch(
        metric='evaluation/off_policy_estimator/doubly_robust_fitted_q_eval/v_target',
        mode='max'
    )

    # scheduler
    scheduler = ASHAScheduler(
        metric='evaluation/off_policy_estimator/doubly_robust_fitted_q_eval/v_target',
        mode='max',
        time_attr='training_iteration',
        max_t=5,
        grace_period=1
    )

    # create tuner
    tuner = Tuner(

        # trainer
        trainer,

        # create tune configuration
        tune_config=self.create_tune_config(
            search_algo=search_algo,
            scheduler=scheduler
        ),

        # hyper-parameters
        param_space=self.create_param_space(),

        # save checkpoint - run configuration doesn't work in Ray Air, use _tuner_kwargs to specify checkpoint settings
        _tuner_kwargs=dict(checkpoint_at_end=True),
    )

    # train models
    result_grid = tuner.fit()

    # convert content in result grid to pandas dataframe
    df_result = self.create_results_dataframe(result_grid=result_grid)

    return df_result

def create_trainer(self, train_dataset):

    return RLTrainer(

        # run config
        run_config=RunConfig(
            stop=dict(training_iteration=5),
            verbose=3
        ),

        # scaling config
        scaling_config=ScalingConfig(
            use_gpu=True
        ),

        # train dataset
        datasets=dict(train=train_dataset),

        # algorithm
        algorithm='DQN',

        # config
        config=dict(
            action_space=self.action_space,
            observation_space=self.observation_space,
            framework='torch',
            evaluation_interval=1,
            evaluation_duration=10000,
            evaluation_duration_unit='episodes',
            evaluation_parallel_to_training=False,
            evaluation_num_workers=1,
            evaluation_config=dict(input=self.offline_data_info['test_dir']),

            # off-policy estimation
            off_policy_estimation_methods=dict(

                # doubly robust method
                doubly_robust_fitted_q_eval=dict(
                    type=DoublyRobust,
                    q_model_config=dict(
                        type=FQETorchModel,
                        model=[64]
                    )
                )
            )
        )
    )

def create_tune_config(self, search_algo, scheduler):

    return tune.TuneConfig(
        num_samples=40,
        search_alg=search_algo,
        scheduler=scheduler
    )

def create_param_space(self):

    return dict(
        lr=tune.loguniform(1e-6, 1e-3),
        observation_filter=tune.choice(['NoFilter', 'MeanStdFilter']),
        batch_mode=tune.choice(['truncate_episodes', 'complete_episodes']),
        train_batch_size=tune.choice([5000]),
        model=dict(
            fcnet_activation=tune.choice(['relu', 'elu']),
            fcnet_hiddens=tune.choice(self.network_configurations)
        )
    )

Thanks,
Stefan

rliaw · January 24, 2023, 7:12pm

Interesting… is it possible to reproduce this on colab?

rliaw · January 24, 2023, 7:14pm

Also cc @arturn , @kourosh

steff · January 24, 2023, 7:54pm

This is a proprietary environment, but I can try to replicate it using a classic Open AI Gym environment. In the meantime, can you tell me if there is anything obvious that I’m doing wrong?

My understanding from the Ray Tune documentation and other comments is that it should be sufficient to create an RLTrainer with ScalingConfig(use_gpu=True) and Ray Tune automatically leverages existing GPU(s) during model training. But maybe I’m missing something.

kourosh · January 26, 2023, 12:04am

Hi @steff , I don’t see anything obvious that may be wrong. Can you share a repro script? Does this happen also when you build the algorithm via algo = config.build() and do algo.train()?

Topic		Replies	Views
[Tune] Ray tune for multi gpu and multi node runs Hangs	2	591	August 26, 2023
No training starts although flag is running Ray Tune	1	505	September 21, 2022
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	76	July 10, 2024
[Ray Train] [Ray Tune] [Ray Clusters] Handling different GPUs (with different GPU memory sizes) in a Ray cluster)	0	454	February 2, 2023
Ray Tune trials getting stuck in a deadlock Ray Tune	1	832	September 13, 2023

RL training stuck when using Ray Tune and GPU

Related topics