Not fully used resources by ray tune

art28 · August 11, 2021, 2:01am

I’m trying to make rl-tune code using ax search + AHBS example on ray cluster(local)
but tune doesn’t exploit all the resources on cluster as figure below.
in detail, it starts from 9 workers running, and then reduced 3 after finishing some of them.
i tried to debug in some way, feels like there is blocking on waiting ready wait function on trial executor(line 707)
i attached resource status below the running code

    ray.init(address="10.0.1.185:6379")

    algo = AxSearch(
        max_concurrent=100,
    )

    scheduler = AsyncHyperBandScheduler()
    analysis = tune.run(
        tune.durable(run),
        name=experiment.name,
        metric=metric, 
        mode=mode,
        search_alg=algo,
        scheduler=scheduler,
        num_samples=500,
        config={
            "rollout_len": tune.qrandint(3, 200, q=10),
            "time_interval": tune.choice([3, 10, 30]),
            "lstm_size": tune.qrandint(32, 512, 32),
            "len_order": tune.randint(1, 6),
            "enable_market_order": tune.choice([True, False]),
            "lr": tune.uniform(1e-5, 3e-3),
            "coef_order_ratio": tune.uniform(0.0, 0.3),
            "use_execution_penalty": tune.choice([True, False]),
            "action_unit_multiplier": tune.randint(2, 10),

        },
        verbose=3,
        resources_per_trial={"cpu": 8, "gpu": 0.25},
        sync_config=tune.SyncConfig(
            upload_dir=f"s3://ray-durable-trial-bucket/{experiment.name}",
            sync_to_driver=False,
        ),
    )

ray_resources

amogkam · August 11, 2021, 2:17am

Hey @art28, thanks for the question! Could you share the stdout from your training script? You can redirect the stdout to a file by following these instructions: User Guide & Configuring Tune — Ray v2.0.0.dev0

art28 · August 11, 2021, 2:42am

Hello @amogkam , i just tried to get stdout, but it failed maybe bacuse…

I’m using ray cluster with 3 nodes, (1 heads + 2 workers) and I set head nodes not to run as worker.
after applying log_to_file as the link you share me, only empty files are generated on worker nodes(not in head node)
Instead, I copied stdout i got from running code below(interrupted in the middle of the run) .
there is no such running code but just 10 second time sleep for run function
thank you!

https://easyupload.io/ts4die
(it’s pretty big so that i cannot upload directly)

Topic		Replies	Views
Resources not available with Ray's multiprocessing Ray Core	4	359	March 11, 2021
Most runs immediately failing with "out of memory" Ray Tune	5	1229	May 11, 2021
Resources not being used Ray Core	4	1292	September 21, 2021
Ray Tune 'RESOURCE_EXHAUSTED' error? Ray Tune	2	561	January 15, 2022
Ray Resources Per Trial Ray Tune	1	127	September 22, 2024

Not fully used resources by ray tune

Related topics