Training trials in parallel on multi-gpu machine

I have a cluster with 1 GPU node which has 4 GPUs, and bunch of other CPU nodes. How do I configure Tune to:
(A) use 1 GPU per trial and run 4 concurrent trials on all 4 gpus
(B) use 0.5 GPU per trial to be able to run 8 concurrent trials on all 4 gpus
?

So far I’ve tried setting
num_gpus: 0.25
in the config dict provided to tune.run() but that doesn’t help, it only uses a single gpu to run a single experiment at a time.

I am using Ray 1.3.0

Hey @roireshef, ah the correct field is “gpu” not “num_gpus”. Can you try setting "gpu": 1 or "gpu": 0.5 and see if that works for you?

@amogkam this might depend on the Ray version in use. In v1.3.0 it is num_gpus, right?
I’ve tried setting that parameter to 0.5 and this is the iteration summary I get:
== Status ==
Memory usage on this node: 33.1/251.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 321.0/1336 CPUs, 0.5/4 GPUs, 0.0/5245.26 GiB heap, 0.0/2460.0 GiB objects
Result logdir: /…/ray_results/debug_lane_merge_gpu_fractions
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
±---------------------------------------------±---------±-----------------±----------------------------------------±-------±-----------------±--------±---------±--------------------------±----------------------+
| Trial name | status | loc | env_config/NOMINAL_INFLOW_PROBABILITY | iter | total time (s) | ts | reward |
|----------------------------------------------±---------±-----------------±----------------------------------------±-------±-----------------±--------±---------±--------------------------±----------------------|
| …_00000 | RUNNING | … :208 | 0.2 | 10 | 812.04 | 6856800 | 0.987751 |
| …_00001 | PENDING | | 0.3 | | | | | | |
| …_00002 | PENDING | | 0.4 | | | | | | |
| …_00003 | PENDING | | 0.5 | | | | | | |
±---------------------------------------------±---------±-----------------±----------------------------------------±-------±-----------------±--------±---------±--------------------------±----------------------+

Also note that I’m using PyTorch, not TF. I’m not sure if that is relevant.

Hey @roireshef, even in 1.3 all the docs point to using gpu as the field, not num_gpus: User Guide & Configuring Tune — Ray v1.3.0 :slight_smile:

But seems like it is properly allocating 0.5 gpus for you, so I think Tune might handle this automatically internally.

The question I have for you is how many CPUs are you requesting per trial, and how many CPUs does that 1 node with 4 GPUs have?

A trial has to be run on a single node; it cannot be split across multiple nodes. So to run all 4 trials in parallel with GPU, all of them have to be run on the 1 node that contains GPU, and that node must have enough CPUs to support them. That would mean your CPU-only nodes are not going to actually be running any trials.

Actually I’m surprised num_gpus is not giving you an error. Do you mind sharing what your full resources_per_trial dict is?

@amogkam I’ve managed to put my finger on the problem, which is the PlacementGroup logic. My setup is as follows:
(Node A): 4xGPU + 32Cores
(Node B): 32Cores

(Node Z): 32Cores

now, I imagined that if I tell Tune to run 4 RLlib trials, each consuming 1xGPU and say, 320 cores, which amount to 10 nodes (B-Z), then Tune will figure out that to maximize bandwidth it should schedule at least the 4 driver processes on Node A, and distribute the rest (workers) on B-Z. What happens in reality is that by using placement_strategy=PACK (the default), the first trial to run hogs up all the cpu cores at Node A, therefore it is the only one running, because all the other 3 trials need a GPU but the gpus are on Node A, which is running at full CPU capacity.

This makes Node A GPUs only 25% utilized, as well as Nodes B-Z CPUs.

I’ve switched to placement_strategy=SPREAD. While it does a better job utilizing more of my cluster’s resources, I could only got it to run 3/4 of the experiments concurrently. And it doesn’t make much sense to distribute the 320 workers of every experiment across 320 different machines…

I think there should be a more sophisticated implementation that doesn’t just schedule trial 1, then goes on to trial 2, and on, but rather tries to maximize utilization assuming multiple trials need to run concurrently. Or, alternatively, lets us specify that we don’t want to schedule workers on the driver node, only on remote nodes.

  • @sven1977 did you guys encounter this use case before? Is there any chance there’s already a solution to that?

Ohhh you’re using RLlib, that’s why num_gpus is working.

Thanks for the detailed explanation here. Just to get the full context, do you mind sharing your code or at least some small example that’s still able to showcase the issue?

Also cc @kai

@amogkam I can’t share a reproducible code yet, but I can share my config. That should do the work, since I don’t believe it has to do with the underlying environment. Here it is:

if __name__ == "__main__":
    print("Running Ray %s" % ray.__version__)

    # Initialize Ray
    ray.init(address=args.redis_address, _redis_password=args.redis_password)

    config = DEFAULT_CONFIG.copy()
    config.update({
        "env": <your favorite environment>,
        "framework": "torch",

        "vtrace": True,
        "sample_async": True,
        "num_gpus": 1,
        "num_workers": 320,
        "learner_queue_size": 16,
        "train_batch_size": 5000,

        "gamma": 0.99,
        "lr": 2e-4,
        "lr_schedule": [(0, 2e-04), (200e6, 1e-04), (400e6, 5e-05)],
        "entropy_coeff": 0.01,
        "entropy_coeff_schedule": [(0, 0.01), (400e6, 0)],
        "grad_clip": 40.0,

        "model": {
            "custom_model": <your favorite model>,
            "max_seq_len": 1
        },

        ## ROLLOUT WORKER ##
        "batch_mode": "truncate_episodes",
        "rollout_fragment_length": 100,
        "num_envs_per_worker": 1,

        # report results every x steps
        # those values should be kept high enough to not create a bottleneck in large clusters
        # see: https://github.com/ray-project/ray/issues/12352#issuecomment-735033292
        "timesteps_per_iteration": 30000,
        "min_iter_time_s": 300  # minimal iteration time in seconds
    })

    tune.run(
        APPOTrainer,
        name=<some experiment name>,
        stop={
            "timesteps_total": 300e6,  # Million steps
        },
        num_samples=4,
        config=config,
        checkpoint_freq=50
    )

The num_samples=4 argument should create 4 different trials.

@amogkam I’ve created some very simple environment using FLOW package (wrapper for SUMO simulator) you can test with the setup above. It’s in here flow_rllib_public/simple at main · roireshef/flow_rllib_public · GitHub and it requires setting up FLOW (see instructions here: Local Installation of Flow — Flow 0.3.0 documentation)

If you need anything else please LMK. And thanks for your help.

1 Like