Ray tune not logging episode metrics with SampleBatch input

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am having an issue using ray tune with the Ray Client/Server setup. I do not have a simulator environment. I am using a live realworld system to gather SampleBatches and then intend to use that for model training. My issue is that ray tune does not seem to be logging episode metrics as part of its training. So I cannot determine which trial/checkpoint is the best one to use for deployment.

This issue seems to be related to something on the github board.

ENV_CONFIG = {
    'env': None,
    'observation_space': gym.spaces.Box(0, 1, shape=(5,)),
    'action_space': gym.spaces.Discrete(6),
}
analysis = tune.run(DQNTrainer,
                    name='dqn-ivr-agent',
                    config={"framework": "torch",
                            "num_workers": 1,
                            "num_gpus": 0,
                            'batch_mode': 'complete_episodes',
                            "input": sample_batch_path,
                            **DQN_TRAIN_CONFIG,
                            **ENV_CONFIG,
                            },
                    keep_checkpoints_num=1,
                    checkpoint_score_attr='episode_reward_mean',
                    stop={"training_iteration": 1},
                    num_samples=1)
best_trial = analysis.get_best_trial(metric='total_loss', mode='min', scope='all')
print(best_trial)
best = analysis.get_best_checkpoint(best_trial, metric='total_loss', mode='min')
print(best)
return best

I receive nans for all the metrics

The trainer also says episode_total = 0. Which is not right. My SampleBatches have full episodes with Done=True on the last step.
image

Is there a way I can get ray tune working with samplebatch style input?

@Yard1 @kai Is this something the tune experts could help?

I believe the issue is in RLLib sample batch handling. Tune just forwards whatever metrics it receives, so it’s likely that RLLib doesn’t provide the correct metrics for this input.

cc @arturn as RLLib on-call

2 Likes

Hey @Jason_Weinberg ,

After this gets merged, try checking out master and let me know if it works for you or if you are still missing anything.

Awesome, I will give it a shot! Thank you for jumping in on this so fast.

1 Like

Has this update made it to the nightly update?
I tried the below but am still getting the nan’s in the output

pip install -U ray
ray install-nightly