Hello, I would like to use ray’s multithreading Pool to run a few distributed jobs. However, I seem to be having an issue with the allocated resources.
I am using ray==0.8.7
My code is a bit complex but I was able to reproduce it in here
I tried to reproduce it with a simple example here, inspired from the documentation:
import time
import ray
from ray import tune
from ray.util.multiprocessing import Pool
def evaluation_fn(step, width, height):
time.sleep(0.1)
return (0.1 + width * step / 100)**(-1) + height * 0.1
def easy_objective(config):
width, height = config["width"], config["height"]
for step in range(config["steps"]):
intermediate_score = evaluation_fn(step, width, height)
tune.report(iterations=step, mean_loss=intermediate_score)
def run_example(num_samples):
print(f"cluster resources {ray.cluster_resources()}")
print(f"cluster available resources {ray.available_resources()}")
_ = tune.run(
easy_objective,
num_samples=num_samples,
config={
"steps": 5,
"width": tune.uniform(0, 20),
"height": tune.uniform(-100, 100),
"activation": tune.grid_search(["relu", "tanh"])
})
def main():
start = time.time()
pool = Pool()
for result in pool.map(run_example, [5, 6]):
print(result)
end = time.time()
delta = end - start
print(f'Took {delta:.3f} seconds')
if __name__ == '__main__':
main()
It seems to me that it’s able to recognize the 8 cores I have, but they all get busy and the run doesn’t start.
Here is a sample of the output:
(pid=3675) cluster resources {'object_store_memory': 32.0, 'memory': 93.0, 'node:192.168.0.81': 1.0, 'CPU': 8.0}
(pid=3675) cluster available resources {'node:192.168.0.81': 1.0, 'object_store_memory': 32.0, 'memory': 93.0}
....
2021-03-09 17:58:21,740 WARNING worker.py:1134 -- The actor or task with ID X is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:192.168.0.81: 1.000000}, {memory: 4.541016 GiB}, {object_store_memory: 1.562500 GiB}. In total there are 0 pending tasks and 16 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
It worked fine when I used python multiprocessing Pool but I was wondering if you had an idea of why this happens with Ray?
Thanks!