Ray doesn't use all CPUs

I have a simple Hyperparameterscript set up as a MWE, but from the 512 CPUs i have over 4 nodes provided to Ray, there are only used maximum sth like 35. For example in the output script sth like this appears:
Trial status: 385 TERMINATED | 34 RUNNING | 1 PENDING
Current time: 2024-03-10 11:16:21. Total running time: 4min 30s
Logical resource usage: 38.0/512 CPUs, 0/0 GPUs
Current best trial: 690de172 with score=36019.68664143786 and params={‘a’: 6.001627870999901, ‘b’: 0.00014951686738453853}
╭───────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status a b iter total time (s) score │
├───────────────────────────────────────────────────────────────────────────────────────────────┤
│ objective_8cd2b423 RUNNING 6.18522 0.00139442 │
│ objective_3c66a3e4 RUNNING 7.63639 0.00173502 │
│ objective_3bd8e3c1 RUNNING 6.93693 0.000985779 │
│ objective_d6564f29 RUNNING 6.94584 0.000694006 │
│ objective_2eb3470b RUNNING 6.9791 0.000238741 │
│ objective_14ef2a26 TERMINATED 6.6198 0.00403778 1 19.9409 43825.8 │
│ objective_98bf559e TERMINATED 6.01495 0.00152969 1 20.3881 36181.1 │
│ objective_3a4ca1e2 TERMINATED 6.00699 0.00113849 1 19.4114 36085 │
│ objective_76923cb9 TERMINATED 7.93839 0.000251659 1 32.122 63018.3 │
│ objective_ad01fc69 TERMINATED 6.00963 0.000950192 1 17.4559 36116.6 │
│ objective_7859beca PENDING 6.40818 0.00333868 │
╰───────────────────────────────────────────────────────────────────────────────────────────────╯

The total running time is also around 5min 30 seconds, even though it should be the length of one function evaluation, so like 20 seconds, maybe 30. I don’t understand why Ray doesn’t use all CPUs, it is embarassingly parallel and at the moment a waste of resources

Versions / Dependencies

Python: 3.10.13
Ray: 2.9.3

Reproduction script

from ray import train, tune
import ray
from ray.air.integrations.wandb import WandbLoggerCallback
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.optuna import OptunaSearch
from ray.air import RunConfig
from time import sleep
import psutil
import optuna

ray.init(address='auto')  # Connect to the existing Ray cluster


def objective(config):
    score = 0
    for i in range(100000000):
        score += 1/1e5*config["a"] ** 2 + 1/1e5*config["b"]
    train.report({"score": score}) 

search_space = {
    "a": tune.uniform(6, 8),
    "b": tune.loguniform(1e-4, 1e-2),
}

optuna_search = OptunaSearch(
    metric="score",
    mode="min",
    sampler=optuna.samplers.RandomSampler()
)


trainable = tune.with_resources(
    objective,
    resources=tune.PlacementGroupFactory([
         {'GPU': 0, 'CPU': 1},
         #{'CPU': 128} for _ in range(total_num_nodes)  # 4 nodes with 128 CPUs each
         #{'CPU': 1} for _ in range(total_num_cpus)  # 
     ]),
 )

tuner = tune.Tuner(
    trainable,
    param_space=search_space,
    tune_config=tune.TuneConfig(
        num_samples=512,
        search_alg=optuna_search,
        metric="score",
        mode="min",
        #max_concurrent_trials=512,  # Run one trial per node concurrently
    ),
    run_config=RunConfig(name="trainable")
)

results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)

AND Bash script:


#!/bin/bash
# shellcheck disable=SC2206
#SBATCH --job-name=test
#SBATCH --nodes=5
#SBATCH --ntasks=5
# nodelist=$(awk '{print $1}' nodes.txt | paste -sd, -)
# #SBATCH --nodelist=$nodelist
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --mem-per-cpu=500MB
#SBATCH --time=00:30:00
#32
source ~/miniconda3/etc/profile.d/conda.sh
conda activate IVF_conda

set -x

# __doc_head_address_start__

# Print the value of the SLURM_JOB_NODELIST variable
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"

# Getting the node names
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")

# Print the value of the nodes variable
echo "nodes: $nodes"
nodes_array=($nodes)

head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# if we detect a space character in the head node IP, we'll
# convert it to an ipv4 address. This step is optional.
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
  head_node_ip=${ADDR[1]}
else
  head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
# __doc_head_address_end__

# __doc_head_ray_start__
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

ray_executable_path='/cluster/home/malmansto/miniconda3/envs/IVF_conda/bin/ray'
python_executable_path='/cluster/home/malmansto/miniconda3/envs/IVF_conda/bin/python' 

echo "Starting HEAD at ${head_node}"
srun --nodes=1 --ntasks=1 -w "${head_node}" \
    "$ray_executable_path" start --head --node-ip-address="${head_node_ip}" --port=$port \
    --num-cpus 0 --num-gpus 0 --block &
# __doc_head_ray_end__

# __doc_worker_ray_start__
# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        "$ray_executable_path" start --address "$ip_head" \
        --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus 0 --block & 
    sleep 5
done
# __doc_worker_ray_end__

# __doc_script_start__
# ray/doc/source/cluster/doc_code/simple-trainer.py
"$python_executable_path" -u test5.py "$SLURM_CPUS_PER_TASK"
#"$python_executable_path" test5.py