Resources not available with Ray's multiprocessing

Hello, I would like to use ray’s multithreading Pool to run a few distributed jobs. However, I seem to be having an issue with the allocated resources.

I am using ray==0.8.7
My code is a bit complex but I was able to reproduce it in here
I tried to reproduce it with a simple example here, inspired from the documentation:

import time
import ray
from ray import tune
from ray.util.multiprocessing import Pool


def evaluation_fn(step, width, height):
    time.sleep(0.1)
    return (0.1 + width * step / 100)**(-1) + height * 0.1


def easy_objective(config):
    width, height = config["width"], config["height"]

    for step in range(config["steps"]):
        intermediate_score = evaluation_fn(step, width, height)
        tune.report(iterations=step, mean_loss=intermediate_score)


def run_example(num_samples):
    print(f"cluster resources {ray.cluster_resources()}")
    print(f"cluster available resources {ray.available_resources()}")
    _ = tune.run(
        easy_objective,
        num_samples=num_samples,
        config={
            "steps": 5,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        })


def main():
    start = time.time()
    pool = Pool()
    for result in pool.map(run_example, [5, 6]):
        print(result)
    end = time.time()
    delta = end - start
    print(f'Took {delta:.3f} seconds')


if __name__ == '__main__':
    main()

It seems to me that it’s able to recognize the 8 cores I have, but they all get busy and the run doesn’t start.

Here is a sample of the output:

(pid=3675) cluster resources  {'object_store_memory': 32.0, 'memory': 93.0, 'node:192.168.0.81': 1.0, 'CPU': 8.0}
(pid=3675) cluster available resources {'node:192.168.0.81': 1.0, 'object_store_memory': 32.0, 'memory': 93.0}
....
2021-03-09 17:58:21,740	WARNING worker.py:1134 -- The actor or task with ID X is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:192.168.0.81: 1.000000}, {memory: 4.541016 GiB}, {object_store_memory: 1.562500 GiB}. In total there are 0 pending tasks and 16 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

It worked fine when I used python multiprocessing Pool but I was wondering if you had an idea of why this happens with Ray?

Thanks!

Hmm, it’s unlikely that Ray Tune will work with multi-processing here. Is there a reason why you need to do this?

Thanks for the answer Richard.

I am basically interested in launching many tune runs in parallel with completely different configs. Python multiprocessing seems to be doing the job well. But I tried Ray’s before and I was curious to know what I did wrong in the code or if you simply think Tune won’t work with that.

Hmm, in the newest Tune (Ray 1.2) you should be able to specify a list of pre-defined configurations via the BasicVariantGenerator. Would that work for you?

Hmm interesting indeed. Could be a good alternative for when we upgrade Ray’s version later. In the meantime, I will find an alternative to this.

Thanks!