Simple hello_world example crashes badly

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m just getting started with Ray, and I’m comparing it with Optuna. I’m trying to run a trivial example - minimizing the Rastrigin function.

import numpy as np
from ray import train, tune
from ray.tune.search.bayesopt import BayesOptSearch
import os

search_space = {
    "x": tune.uniform(-5.12, 5.12),
    "y": tune.uniform(-5.12, 5.12),
}

def rastrigin(config):
    # the Rastrigin function
    score = (
        config['x'] ** 2
        - 10 * np.cos(2 * np.pi * config['x'])
        + config['y'] ** 2
        - 10 * np.cos(2 * np.pi * config['y'])
        + 20
    )
    return {"score": score}

bayes_search = BayesOptSearch(metric='score', mode='min')
# tune_config = tune.TuneConfig(search_alg=bayes_search, num_samples=-1, max_concurrent_trials=os.cpu_count(), time_budget_s=10)
tune_config = tune.TuneConfig(mode='min', metric='score', num_samples=-1, max_concurrent_trials=os.cpu_count(), time_budget_s=10)

run_config = train.RunConfig(verbose=0)
tuner = tune.Tuner(rastrigin, param_space=search_space, tune_config=tune_config, run_config=run_config)

results = tuner.fit()

Running this simple code generates a huge error:

Enabling Bayesian search fixes it temporarily, but if I increase the time budget to 60, it crashes the same way.

Ubuntu 22.04, Python 3.11.7, the latest versions of all relevant Python modules including Ray. I’ve tried to install Ray a couple different ways, but it makes no difference:

pip install --user ray

pip install --user "ray[all]"

I’m not trying to do any rocket science here, just following the simplest examples from the website, but it just doesn’t run properly. :frowning_face:

Would be nice if this project had a solid hello-world example, in a prominent place on the website, that actually works. This would help people like me who are completely new to the project.

I played a little bit with this code, and was able to reproduce the error. After just changing the following parameter num_samples=1 instead of -1, the code seems to be running fine.

I don’t know the root cause of this error, probably a bug, but I wanted to share this in case this helps and gets you unblocked.

1 Like

I’ve tried that, but then only 1 trial is performed. What I want is to run as many trials as possible within a given time (10 sec).

1 Like

yeah It makes sense. Just in case, using 10K or 100K seem to work as well, but I see your point. I’m also getting started Ray, let’s see what other folks say.

I’m also getting the error if I try to do a more substantial search using OptunaSearch.

import numpy as np
from ray import train, tune
from ray.tune.search.optuna import OptunaSearch

def rastrigin(config):
    score = (
        config['x'] ** 2
        - 10 * np.cos(2 * np.pi * config['x'])
        + config['y'] ** 2
        - 10 * np.cos(2 * np.pi * config['y'])
        + 20
    )
    return {"score": score}

search_space = {
    "x": tune.uniform(-5.12, 5.12),
    "y": tune.uniform(-5.12, 5.12),
}

optuna_search = OptunaSearch(metric='score', mode='min')
tune_config = tune.TuneConfig(search_alg=optuna_search, num_samples=-1, time_budget_s=20)
run_config = train.RunConfig(name='rastrigin', verbose=0)

tuner = tune.Tuner(rastrigin, param_space=search_space, tune_config=tune_config, run_config=run_config)
results = tuner.fit()

Using a large integer value for num_samples does not fix it.

The error:

2023-12-28 16:54:42,007	ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
(bundle_reservation_check_func pid=664883) Traceback (most recent call last):
(bundle_reservation_check_func pid=664883)   File "python/ray/_raylet.pyx", line 1788, in ray._raylet.execute_task
(bundle_reservation_check_func pid=664883)   File "python/ray/_raylet.pyx", line 1790, in ray._raylet.execute_task
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 790, in deserialize_objects
(bundle_reservation_check_func pid=664883)     context = self.get_serialization_context()
(bundle_reservation_check_func pid=664883)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 678, in get_serialization_context
(bundle_reservation_check_func pid=664883)     context_map[job_id] = serialization.SerializationContext(self)
(bundle_reservation_check_func pid=664883)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/_private/serialization.py", line 153, in __init__
(bundle_reservation_check_func pid=664883)     serialization_addons.apply(self)
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/util/serialization_addons.py", line 29, in apply
(bundle_reservation_check_func pid=664883)     from ray._private.pydantic_compat import register_pydantic_serializers
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/_private/pydantic_compat.py", line 2, in <module>
(bundle_reservation_check_func pid=664883)     from pkg_resources import packaging
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/pkg_resources/__init__.py", line 31, in <module>
(bundle_reservation_check_func pid=664883)     import pkgutil
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap_external>", line 936, in exec_module
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap_external>", line 1069, in get_code
(bundle_reservation_check_func pid=664883)   File "<frozen importlib._bootstrap_external>", line 729, in _compile_bytecode
(bundle_reservation_check_func pid=664883)   File "/home/florin/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 841, in sigterm_handler
(bundle_reservation_check_func pid=664883)     raise_sys_exit_with_custom_error_message(
(bundle_reservation_check_func pid=664883)   File "python/ray/_raylet.pyx", line 846, in ray._raylet.raise_sys_exit_with_custom_error_message
(bundle_reservation_check_func pid=664883) SystemExit: 1

Nothing concrete yet, just a few observations/hints:

  • Looking at this log /tmp/ray/session_latest/logs/raylet.out
    I noticed “main.cc:372: Raylet received SIGTERM”, so I get the impression the OS is killing the ray workers, probably due to memory/resource exhaustion.

  • There are other logs in the same folder, but I could not find any specific errors, which makes sense if the processes are being killed.

  • I’m planning to try the suggestions in this page in more detail.

  • It seems that high parallelism can trigger OOM errors, maybe a large num_samples for HPO is exceeding the capacity of our local machines, or maybe it needs proper configs.

  • I’m using macos, so the memory monitor is not running, as it is only supported in Linux, I wonder if you can use it in Linux and get additional information.

  • Running the code with a low number of num_samples seems to work, only after increasing it the error starts to show, which makes sense if the issue is about resource exhaustion.

Good to know!

At least in my case, it’s not exhausting RAM. I have 64 GB of RAM on this machine, and 75% of it is unused while the code is running. I run htop all the time. I’m pretty sure htop is available on MacOS via Brew.

I need to be able to generate 10k trials total, using all CPUs, as fast as possible - lowering that limit is not an option for me.

Using multiple instances of Optuna (one for each CPU core) in joblib is actually faster for me. For a different problem (not the Rastrigin function, but an actual model optimization), I get 10k trials in 5 minutes with Optuna, with the MySQL storage. With Ray, I only get about 8k. I was hoping Ray would be faster.

I’ll also try to investigate some more tomorrow.