Ray version: 1.0.1
I’m using Ray Tune to train a custom trainable function. I have 12 cpus and 1 gpu on my machine, but am initiating Ray with only 5 cpus as the following code shows:
ray.init(ignore_reinit_error=True, num_cpus=5)
sbedqn_config = {
...
# == Parallelism & Resources ==
"num_workers": 4,
"num_envs_per_worker": 1,
"num_cpus_per_worker": 1,
"num_gpus_per_worker": 0,
"num_cpus_for_driver": 1,
"num_gpus": 0,
...
tune.run(
run_or_experiment=train_sbedqn,
name=f"SBEDQN-{sbedqn_config['env']}_{now_date}-{now_time}",
config=exp_config,
num_samples=1,
stop={"training_iteration": exp_config["max_iterations"]},
local_dir=get_save_dir(),
checkpoint_freq=exp_config["checkpoint_freq"],
checkpoint_at_end=True
)
But the log shows that only one CPU is being used:
Can someone figure out what the issue might be?
I initially tried using version 1.6.0, but that got stuck in PENDING forever with the following log.
Required resources for this actor or task: {CPU_group_c9c02268f9e7a6f6b2c2c91eeb57308d: 1.000000}
Available resources on this node: {4.000000/5.000000 CPU, 2.365269 GiB/2.365269 GiB memory, 1.000000/1.000000 GPU, 1.182635 GiB/1.182635 GiB object_store_memory, 1000.000000/1000.000000 bundle_group_0_c9c02268f9e7a6f6b2c2c91eeb57308d, 0.000000/1.000000 CPU_group_c9c02268f9e7a6f6b2c2c91eeb57308d, 1.000000/1.000000 node:192.168.1.5, 0.000000/1.000000 CPU_group_0_c9c02268f9e7a6f6b2c2c91eeb57308d, 1000.000000/1000.000000 bundle_group_c9c02268f9e7a6f6b2c2c91eeb57308d}
In total there are 0 pending tasks and 4 pending actors on this node.
Apparently some people were able to solve this issue by having time.sleep()
between ray.init(...)
and tune.run(...)
in version 0.8.x. However, no matter how long I’ve set time.sleep()
to, it never ran for me.
So I just downgraded to version 1.0.1. Now it’s at least running, but definitely not using the resources available/requested.