Avoiding "too many workers" when calling tuner.fit() from remote functions

In my latest experiment, the raylet eventually crashed after around 20 minutes due to a system error “Resource temporarily unavailable”
The raylet.out is full of lines “Starting too many worker processes for a single type of task. Worker process startup is being throttled.”

So I believe the issue is related to the github discussion Ray starts too many workers (and may crash) when using nested remote functions. · Issue #3644 · ray-project/ray · GitHub

I was starting ~400 remote functions in a loop and then ray.get() on all of them. Each remote function in turn starts a tuner.fit() operation with max_concurrent_trials=1. The Trainables themselve just need 1 cpu and don’t spawn any new threads/processes or remote functions. So each of the ~400 tasks should require no more than 1 cpu at a time because the .fit() call only performs one trial at a time. If ray doesn’t schedule more of those than there are available cpus, I don’t see why this setup shouldn’t work but it’s hard to tell because I’m not sure what tuner.fit() does internally.

Any ideas why this setup is creating apparently way too many workers?

Hey @Adrian-LA , can you share your code so that we can better diagnose the issue? Also, have you tried to create 400 trials in a Tuner instead of creating 400 Tuners?

For example:

space = {"x": tune.uniform(0, 1)}
tuner = tune.Tuner(
    my_trainable,
    param_space=space,
    tune_config=tune.TuneConfig(num_samples=400),
)
results = tuner.fit()

Also, have you tried to create 400 trials in a Tuner instead of creating 400 Tuners?

I have, but this is generally worse for parallelism, requires the same data to be loaded/held in every worker and wastes (some of) the benefits of sequential optimization with certain search algorithms.

I have since found out that running multiple tune fits at the same time isn’t officially supported by ray, which is probably why this use case doesn’t seem to work as is. I suspect in between the sequential Trials within a Tuner fit, the cpu briefly becomes available and ray may schedule another of the parent remote functions so eventually all four hundred may be active when in reality there shouldn’t be more of this task than the total CPU number.
I’ll try to work around this with custom resources to prevent too many Tuner fits to be active at once.

If ray had a priority system, this could probably solve cases like this elegantly.

The ray usage basically looks like this:

dispatched_tasks = []
for task in optimization_tasks:
    dispatched_tasks.append(perform_fit.remote(...))
ray.get(dispatched_tasks)


@ray.remote()
def perform_fit(...):
    run_config  = air.RunConfig(stop={'training_iteration': 1})
    tune_config = tune.TuneConfig(metric='tpl', mode='max', num_samples=50)
    tune_config.max_concurrent_trials = 1
    tuner = tune.Tuner(Trainable, tune_config=tune_config, param_space=params, run_config=run_config)
    results = tuner.fit()

In case anyone with similar goals finds this: limiting the number of concurrent Tuner fits with custom resources seems like a good solution so far. The whole job ran this time without crashes and ray doesn’t complain about too many workers anymore.

It requires little modification:

  1. Define the custom resource when starting nodes like --resources=‘{“limit_tuners”: 16}’
  2. Tell ray that the remote function, which launches the Tuner fit, needs this resource
    @ray.remote(resources={“limit_tuners”: 1})

This will ensure that at most 16 of those tasks are active at once on this node

1 Like