Ray using so much memory I cannot even start the tuning

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi! I’m new to ray tune and I’m trying to use it to tune some deep learning models. I am having memory problems and I cannot locate the source of the problem. I cannot even start training my models. Let’s see if someone can help me or guide me to find the problem.

Somehow, ray starts using a lot of memory (around 9GB) although the dataset and my training class are way smaller. Also, I believe ray is only creating one single trial with this config since I am not using num_samples yet. Therefore, it shouldn’t even be as issue of the dataset being instantiated/copied multiple times or about many torch models being created.

I’ll try to sum up the main parts of my code:

I got some custom pytorch model classes and one class (ExperimentRunner) that handles the whole training and testing process (dataloader, epochs, etc). I call ray.tune.report after each epoch from this class. This class receives the dataset and data samplers as params and builds the train and validation dataloaders in the __init__ method. I also tried using ref = ray.put(dataset) and retrieving it from the ExperimentRunner using ray.get(ref). By doing this, ray stopped complaining about the function size but the memory problems continued anyway.

I use ray tune as follows (I got it organized in different files and functions but it’s basically this):

# retrieve data 
dataset = ...
train_sampler, val_sampler = ....
dataset_ref = ray.put(dataset)

# set fixed and search space config for trials
fixed_config = { "model_class": Model3, "epochs": 30}
search_space = {
    "model_config": {
        "nhead": rt.quniform(2, 64, 2),
        "embedding_size": rt.sample_from(lambda spec: spec.config.model_config.nhead * int(np.random.uniform(10))),
        "dim_transformer_feedforward": rt.quniform(64,2048,2),
        "num_layers": rt.quniform(1, 10, 1.0),
        "positional_encoding_dropout": rt.uniform(0.0, 0.5),
        "transformer_encoder_dropout": rt.uniform(0.0, 0.5),
        "classifier_dropout": rt.uniform(0.0, 0.5),
    }
}

# prepare the experiment runner
experiment_runner_constructor = ft.partial(ExperimentRunner,
                                           dataset_ref=dataset_ref,
                                           train_sampler=train_sampler,
                                           val_sampler=val_sampler,
                                           callback=rt.report)

# train function that'll be passed to ray tune, it merges the fixed and tunable config, 
# calls the ExperimentRunner constructor and runs the experiment
def merge_params_and_run(config, constructor, fixed_config):
    experiment_runner_config = dict_merger.merge(fixed_config, config)
    experiment = constructor(**experiment_runner_config)
    experiment.run_experiment()

# train function with params
train_fun = ft.partial(merge_params_and_run, constructor=experiment_runner_constructor, fixed_config=fixed_config)

# setup ray tuner
tuner = ray.tune.Tuner(
        trainable=train_fun,
        param_space=search_space,
        tune_config=ray.tune.TuneConfig(
            scheduler=ray.tune.schedulers.ASHAScheduler(metric="val_loss", mode="min"),
        )
)

# tune
tuner.fit()

I read in some post to use pickle to check object sized so I checked for some variables:

print("experiment_runner_constructor", len(cloudpickle.dumps(experiment_runner_constructor)) // (1024 * 1024))
print("dataset", len(cloudpickle.dumps(dataset)) // (1024 * 1024))
print("train_sampler", len(cloudpickle.dumps(train_sampler)) // (1024 * 1024))
print("train_fun", len(cloudpickle.dumps(train_fun)) // (1024 * 1024))
print("search_space", len(cloudpickle.dumps(search_space)) // (1024 * 1024))

After using ray.put(dataset), everything is 0 except for dataset. I had problems with a 31MB dataset so I reduced it but I still have the same problems with a very small (2MB) dataset.

The logs and error I get follows:

2023-04-09 16:34:11,623	INFO worker.py:1553 -- Started a local Ray instance.
== Status ==
Current time: 2023-04-09 16:34:22 (running for 00:00:07.13)
Memory usage on this node: 7.3/15.5 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name                       | status   | loc                  |   model_config/classif |   model_config/dim_tra |   model_config/embeddi |   model_config/nhead |   model_config/num_lay |   model_config/positio |   model_config/transfo |
|                                  |          |                      |            ier_dropout |   nsformer_feedforward |                ng_size |                      |                    ers |   nal_encoding_dropout |   rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING  | 192.168.1.101:122098 |               0.432366 |                   1272 |                    276 |                   46 |                5.24044 |              0.0990108 |              0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+


(func pid=122098) 2023-04-09 16:34:22,464 - 2023-04-09 16:34:22_Testing Model 3 - INFO - Epoch 0/30
  0%|          | 0/9 [00:00<?, ?it/s]
(func pid=122098) experiment_runner size 22
== Status ==
Current time: 2023-04-09 16:34:27 (running for 00:00:12.14)
Memory usage on this node: 9.5/15.5 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name                       | status   | loc                  |   model_config/classif |   model_config/dim_tra |   model_config/embeddi |   model_config/nhead |   model_config/num_lay |   model_config/positio |   model_config/transfo |
|                                  |          |                      |            ier_dropout |   nsformer_feedforward |                ng_size |                      |                    ers |   nal_encoding_dropout |   rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING  | 192.168.1.101:122098 |               0.432366 |                   1272 |                    276 |                   46 |                5.24044 |              0.0990108 |              0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+


== Status ==
Current time: 2023-04-09 16:34:32 (running for 00:00:17.14)
Memory usage on this node: 12.3/15.5 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name                       | status   | loc                  |   model_config/classif |   model_config/dim_tra |   model_config/embeddi |   model_config/nhead |   model_config/num_lay |   model_config/positio |   model_config/transfo |
|                                  |          |                      |            ier_dropout |   nsformer_feedforward |                ng_size |                      |                    ers |   nal_encoding_dropout |   rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING  | 192.168.1.101:122098 |               0.432366 |                   1272 |                    276 |                   46 |                5.24044 |              0.0990108 |              0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+


== Status ==
Current time: 2023-04-09 16:34:37 (running for 00:00:22.16)
Memory usage on this node: 15.0/15.5 GiB : ***LOW MEMORY*** less than 10% of the memory on this node is available for use. This can cause unexpected crashes. Consider reducing the memory used by your application or reducing the Ray object store size by setting `object_store_memory` when calling `ray.init`.
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name                       | status   | loc                  |   model_config/classif |   model_config/dim_tra |   model_config/embeddi |   model_config/nhead |   model_config/num_lay |   model_config/positio |   model_config/transfo |
|                                  |          |                      |            ier_dropout |   nsformer_feedforward |                ng_size |                      |                    ers |   nal_encoding_dropout |   rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING  | 192.168.1.101:122098 |               0.432366 |                   1272 |                    276 |                   46 |                5.24044 |              0.0990108 |              0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+


Result for merge_params_and_run_99f58_00000:
  date: 2023-04-09_16-34-22
  experiment_id: 9180987839a844f881e7cd1b39dde698
  hostname: vant-N2x0WU
  node_ip: 192.168.1.101
  pid: 122098
  timestamp: 1681050862
  trial_id: 99f58_00000
  
2023-04-09 16:34:37,539	ERROR trial_runner.py:1062 -- Trial merge_params_and_run_99f58_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "<path>/venv/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "<path>/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File <path>/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.101, ID: cd5b328d1d3ff2839d9b596f203a7c7ed02cbf8092f5ecd0cf7f9373) where the task (actor ID: 42adcfbd460cff245398d59601000000, name=ImplicitFunc.__init__, pid=122098, memory used=7.86GB) was running was 15.06GB / 15.52GB (0.970681), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.101`. To see the logs of the worker, use `ray logs worker-2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d*out -ip 192.168.1.101. Top 10 memory users:
PID	MEM(GB)	COMMAND
122098	7.86	ray::ImplicitFunc.train
9853	2.58	/home/raquel/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/223.8617.48/jbr/bin/java -classpath ...
121376	0.33	<path>/venv/bin/python<path>...
3506	0.22	./jetbrains-toolbox --minimize
12732	0.21	/opt/google/chrome/chrome
4016	0.18	/usr/bin/gnome-software --gapplication-service
3107	0.16	/usr/bin/gnome-shell
110480	0.14	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
46629	0.13	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
12850	0.12	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

== Status ==
Current time: 2023-04-09 16:34:37 (running for 00:00:22.61)
Memory usage on this node: 15.0/15.5 GiB : ***LOW MEMORY*** less than 10% of the memory on this node is available for use. This can cause unexpected crashes. Consider reducing the memory used by your application or reducing the Ray object store size by setting `object_store_memory` when calling `ray.init`.
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 ERROR)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name                       | status   | loc                  |   model_config/classif |   model_config/dim_tra |   model_config/embeddi |   model_config/nhead |   model_config/num_lay |   model_config/positio |   model_config/transfo |
|                                  |          |                      |            ier_dropout |   nsformer_feedforward |                ng_size |                      |                    ers |   nal_encoding_dropout |   rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | ERROR    | 192.168.1.101:122098 |               0.432366 |                   1272 |                    276 |                   46 |                5.24044 |              0.0990108 |              0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
Number of errored trials: 1
+----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                       |   # failures | error file                                                                                                                                                                                                                         |
|----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| merge_params_and_run_99f58_00000 |            1 | /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14/merge_params_and_run_99f58_00000_0_classifier_dropout=0.4324,dim_transformer_feedforward=1272.0000,embedding_size=276.0000,nhead=4_2023-04-09_16-34-14/error.txt |
+----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

2023-04-09 16:34:37,553	ERROR ray_trial_executor.py:930 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "<path>/venv/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 921, in _resolve_stop_event
    ray.get(future, timeout=timeout)
  File "<path>/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "<path>/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.101, ID: cd5b328d1d3ff2839d9b596f203a7c7ed02cbf8092f5ecd0cf7f9373) where the task (actor ID: 42adcfbd460cff245398d59601000000, name=ImplicitFunc.__init__, pid=122098, memory used=7.86GB) was running was 15.06GB / 15.52GB (0.970681), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.101`. To see the logs of the worker, use `ray logs worker-2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d*out -ip 192.168.1.101. Top 10 memory users:
PID	MEM(GB)	COMMAND
122098	7.86	ray::ImplicitFunc.train
9853	2.58	/home/raquel/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/223.8617.48/jbr/bin/java -classpath ...
121376	0.33	<path>/venv/bin/python <path>...
3506	0.22	./jetbrains-toolbox --minimize
12732	0.21	/opt/google/chrome/chrome
4016	0.18	/usr/bin/gnome-software --gapplication-service
3107	0.16	/usr/bin/gnome-shell
110480	0.14	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
46629	0.13	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
12850	0.12	/opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

2023-04-09 16:34:37,557	ERROR tune.py:794 -- Trials did not complete: [merge_params_and_run_99f58_00000]
2023-04-09 16:34:37,557	INFO tune.py:798 -- Total run time: 22.84 seconds (22.60 seconds for the tuning loop).

Process finished with exit code 0

I can also see on the system monitor this process taking up 8GB of memory:

Thanks in advance!

could you maybe try a simple trial with just

def func(config):
  dataset = ...
  train_sampler, val_sampler = ...
  runner = ExperimentRunner(dataset, train_sampler=xx, val_sampler=yy, config=config)

and see what is the memory consumption in this case?

I tried, and it does take up a bit of memory although not as much (at most 5GB). It does not go up as the training advances. I am not sure what takes up so much memory though. I really should look into that but still, is it usual that using raytune increases the needed memory up to 3GB more?

In fact, when running with ray, I think there’s not only the 8GB process but also a separate one using more memory (around 2GB) (which also runs when I run the experiment separately but I included the memory in the 5GB total)

In the photo, the ray idle process keeps growing in memory
Selection_864

As far as I understand it from reading through this thread, I have more or less the same issue with my 16GB RAM. Even though the CPU load remains rather low when starting actors for parallel trials for PPO, the RAM consumption comes soon at its limits. Short-notice remedy is available by setting max_concurrent_trials in the TuneConfig. Nevertheless, I went deeper with the help of memory-profiler. It turns out that one instance of the gymnasium environment consumes up to 80 MiB RAM in one actor.

Hi @raquelhortab,

it’s hard to investigate this without knowing how the external classes look like. Generally Ray mostly does orchestration - i.e. the memory usage mostly comes from the functions you actually run, and not Ray.

That said, there are a number of pitfalls that can lead to more memory consumption. This can e.g. be the case when data is copied multiple times.

In your example, you can try using tune.with_parameters instead of functools.partial which will avoid some possible memory duplications. Another option is, as @xwjiang2010 mentioned, to instantiate the data loaders in the training function itself and not outside.