I’m trying to make rl-tune code using ax search + AHBS example on ray cluster(local)
but tune doesn’t exploit all the resources on cluster as figure below.
in detail, it starts from 9 workers running, and then reduced 3 after finishing some of them.
i tried to debug in some way, feels like there is blocking on waiting ready wait function on trial executor(line 707)
i attached resource status below the running code
ray.init(address="10.0.1.185:6379")
algo = AxSearch(
max_concurrent=100,
)
scheduler = AsyncHyperBandScheduler()
analysis = tune.run(
tune.durable(run),
name=experiment.name,
metric=metric,
mode=mode,
search_alg=algo,
scheduler=scheduler,
num_samples=500,
config={
"rollout_len": tune.qrandint(3, 200, q=10),
"time_interval": tune.choice([3, 10, 30]),
"lstm_size": tune.qrandint(32, 512, 32),
"len_order": tune.randint(1, 6),
"enable_market_order": tune.choice([True, False]),
"lr": tune.uniform(1e-5, 3e-3),
"coef_order_ratio": tune.uniform(0.0, 0.3),
"use_execution_penalty": tune.choice([True, False]),
"action_unit_multiplier": tune.randint(2, 10),
},
verbose=3,
resources_per_trial={"cpu": 8, "gpu": 0.25},
sync_config=tune.SyncConfig(
upload_dir=f"s3://ray-durable-trial-bucket/{experiment.name}",
sync_to_driver=False,
),
)