Resources not being used

Kai_Yun · September 10, 2021, 7:05am

Ray version: 1.0.1

I’m using Ray Tune to train a custom trainable function. I have 12 cpus and 1 gpu on my machine, but am initiating Ray with only 5 cpus as the following code shows:

ray.init(ignore_reinit_error=True, num_cpus=5)

sbedqn_config = {
    ...

    # == Parallelism & Resources ==
    "num_workers": 4,
    "num_envs_per_worker": 1,
    "num_cpus_per_worker": 1,
    "num_gpus_per_worker": 0,
    "num_cpus_for_driver": 1,
    "num_gpus": 0,
    
    ...

tune.run(
    run_or_experiment=train_sbedqn,
    name=f"SBEDQN-{sbedqn_config['env']}_{now_date}-{now_time}",
    config=exp_config,
    num_samples=1,
    stop={"training_iteration": exp_config["max_iterations"]},
    local_dir=get_save_dir(),
    checkpoint_freq=exp_config["checkpoint_freq"],
    checkpoint_at_end=True
)

But the log shows that only one CPU is being used:

Can someone figure out what the issue might be?

I initially tried using version 1.6.0, but that got stuck in PENDING forever with the following log.

Required resources for this actor or task: {CPU_group_c9c02268f9e7a6f6b2c2c91eeb57308d: 1.000000} 
Available resources on this node: {4.000000/5.000000 CPU, 2.365269 GiB/2.365269 GiB memory, 1.000000/1.000000 GPU, 1.182635 GiB/1.182635 GiB object_store_memory, 1000.000000/1000.000000 bundle_group_0_c9c02268f9e7a6f6b2c2c91eeb57308d, 0.000000/1.000000 CPU_group_c9c02268f9e7a6f6b2c2c91eeb57308d, 1.000000/1.000000 node:192.168.1.5, 0.000000/1.000000 CPU_group_0_c9c02268f9e7a6f6b2c2c91eeb57308d, 1000.000000/1000.000000 bundle_group_c9c02268f9e7a6f6b2c2c91eeb57308d} 
In total there are 0 pending tasks and 4 pending actors on this node.

Apparently some people were able to solve this issue by having time.sleep() between ray.init(...) and tune.run(...) in version 0.8.x. However, no matter how long I’ve set time.sleep() to, it never ran for me.

So I just downgraded to version 1.0.1. Now it’s at least running, but definitely not using the resources available/requested.

xwjiang2010 · September 10, 2021, 3:18pm

Hi @Kai_Yun,
Thanks for reporting the issue.
Can you provide me with a repro script for the pending issue? I would like to try it out on 1.6.0.
BTW, have you tried specifying resources per trial through tune.run?
By default, only 1 CPU and 0 GPU is allocated per trial.
If you want to run multiple trials, you can specify num_samples > 1.

Hope this helps!

Kai_Yun · September 17, 2021, 5:46am

Sorry for replying so late. Apparently all the requested CPUs were actually used, but it seems like Tune just didn’t log it properly. I was able to tell that it utilizes 1 CPU for trainer and 4 CPUs for 4 workers from the warning log shown below:

(pid=22040) WARNING:tensorflow:From C:\Users\kaiyu\anaconda3\envs\ray_env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=22040) Instructions for updating:
(pid=22040) non-resource variables are not supported in the long term
(pid=22040) 2021-09-16 21:56:03,499	ERROR syncer.py:63 -- Log sync requires rsync to be installed.
(pid=22040) 2021-09-16 21:56:03,517	INFO trainer.py:592 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=22040) 2021-09-16 21:56:03,517	INFO trainer.py:617 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=31888) WARNING:tensorflow:From C:\Users\kaiyu\anaconda3\envs\ray_env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=31888) Instructions for updating:
(pid=31888) non-resource variables are not supported in the long term
(pid=31260) WARNING:tensorflow:From C:\Users\kaiyu\anaconda3\envs\ray_env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=31260) Instructions for updating:
(pid=31260) non-resource variables are not supported in the long term
(pid=15172) WARNING:tensorflow:From C:\Users\kaiyu\anaconda3\envs\ray_env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=15172) Instructions for updating:
(pid=15172) non-resource variables are not supported in the long term
(pid=11776) WARNING:tensorflow:From C:\Users\kaiyu\anaconda3\envs\ray_env\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=11776) Instructions for updating:
(pid=11776) non-resource variables are not supported in the long term

From my personal experience, each warning represents each instantiation. From the looks of it, PID=22040 is the trainer and the others are workers. Thus, I think Ray actually does use all 5 CPUs as requested, but did not log it properly

I did what you recommended and set the resources_per_trial and commented out the resource assignment in the config dictionary:

tune.run(
    run_or_experiment=train_sbedqn,
    name=f"SBEDQN-{sbedqn_config['env']}_{date}-{time}",
    config=exp_config,
    num_samples=1,
    resources_per_trial={
        "cpu": 1,
        "extra_cpu": 4
    },
    stop={"training_iteration": exp_config["max_iterations"]},
    local_dir=get_save_dir(),
    checkpoint_freq=exp_config["checkpoint_freq"],
    checkpoint_at_end=True
)

And it worked like a charm (or at least I think it did)! The following log showed up:

I just have two questions:

Does “cpu” mean CPU for trainer and “extra_cpu” is CPU for workers? So by setting the resources as shown above, I’d have four workers each with one CPU or one worker with four CPUs?
Can I both specify num_workers in config dictionary and resources in resources_per_trial?

Thanks a lot!

xwjiang2010 · September 21, 2021, 4:57pm

Hi Kai,
Glad that it works out.
Try specifying both resources_per_trial and num_workers.
I believe extra_cpu is the overall amount of CPUs divide up among the num_workers that you specify.

xwjiang2010 · September 21, 2021, 4:58pm

One more comment, if you are using the training function supplied by RLlib itself, you wouldn’t need to specify resources_per_trial, as RLlib as some default resources spec for each algorithm that it provides. But in this case, since you are using your own, so you have to specify that.

Topic		Replies	Views
Ray doesn't use all CPUs Ray Tune	0	292	March 10, 2024
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	379	October 18, 2021
Pytorch uses only one cpu per trial Ray Tune	2	555	December 3, 2021
All ray resources mapped to only two physical processors Configure Algorithm, Training, Evaluation, Scaling	0	205	December 8, 2023
Ray Trainer looking for more CPU's than that of its initialized on Ray Train	1	726	September 27, 2022

Resources not being used

Related topics