How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi! I’m new to ray tune and I’m trying to use it to tune some deep learning models. I am having memory problems and I cannot locate the source of the problem. I cannot even start training my models. Let’s see if someone can help me or guide me to find the problem.
Somehow, ray starts using a lot of memory (around 9GB) although the dataset and my training class are way smaller. Also, I believe ray is only creating one single trial with this config since I am not using num_samples yet. Therefore, it shouldn’t even be as issue of the dataset being instantiated/copied multiple times or about many torch models being created.
I’ll try to sum up the main parts of my code:
I got some custom pytorch model classes and one class (ExperimentRunner
) that handles the whole training and testing process (dataloader, epochs, etc). I call ray.tune.report
after each epoch from this class. This class receives the dataset and data samplers as params and builds the train and validation dataloaders in the __init__
method. I also tried using ref = ray.put(dataset)
and retrieving it from the ExperimentRunner using ray.get(ref)
. By doing this, ray stopped complaining about the function size but the memory problems continued anyway.
I use ray tune as follows (I got it organized in different files and functions but it’s basically this):
# retrieve data
dataset = ...
train_sampler, val_sampler = ....
dataset_ref = ray.put(dataset)
# set fixed and search space config for trials
fixed_config = { "model_class": Model3, "epochs": 30}
search_space = {
"model_config": {
"nhead": rt.quniform(2, 64, 2),
"embedding_size": rt.sample_from(lambda spec: spec.config.model_config.nhead * int(np.random.uniform(10))),
"dim_transformer_feedforward": rt.quniform(64,2048,2),
"num_layers": rt.quniform(1, 10, 1.0),
"positional_encoding_dropout": rt.uniform(0.0, 0.5),
"transformer_encoder_dropout": rt.uniform(0.0, 0.5),
"classifier_dropout": rt.uniform(0.0, 0.5),
}
}
# prepare the experiment runner
experiment_runner_constructor = ft.partial(ExperimentRunner,
dataset_ref=dataset_ref,
train_sampler=train_sampler,
val_sampler=val_sampler,
callback=rt.report)
# train function that'll be passed to ray tune, it merges the fixed and tunable config,
# calls the ExperimentRunner constructor and runs the experiment
def merge_params_and_run(config, constructor, fixed_config):
experiment_runner_config = dict_merger.merge(fixed_config, config)
experiment = constructor(**experiment_runner_config)
experiment.run_experiment()
# train function with params
train_fun = ft.partial(merge_params_and_run, constructor=experiment_runner_constructor, fixed_config=fixed_config)
# setup ray tuner
tuner = ray.tune.Tuner(
trainable=train_fun,
param_space=search_space,
tune_config=ray.tune.TuneConfig(
scheduler=ray.tune.schedulers.ASHAScheduler(metric="val_loss", mode="min"),
)
)
# tune
tuner.fit()
I read in some post to use pickle to check object sized so I checked for some variables:
print("experiment_runner_constructor", len(cloudpickle.dumps(experiment_runner_constructor)) // (1024 * 1024))
print("dataset", len(cloudpickle.dumps(dataset)) // (1024 * 1024))
print("train_sampler", len(cloudpickle.dumps(train_sampler)) // (1024 * 1024))
print("train_fun", len(cloudpickle.dumps(train_fun)) // (1024 * 1024))
print("search_space", len(cloudpickle.dumps(search_space)) // (1024 * 1024))
After using ray.put(dataset)
, everything is 0 except for dataset. I had problems with a 31MB dataset so I reduced it but I still have the same problems with a very small (2MB) dataset.
The logs and error I get follows:
2023-04-09 16:34:11,623 INFO worker.py:1553 -- Started a local Ray instance.
== Status ==
Current time: 2023-04-09 16:34:22 (running for 00:00:07.13)
Memory usage on this node: 7.3/15.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | model_config/classif | model_config/dim_tra | model_config/embeddi | model_config/nhead | model_config/num_lay | model_config/positio | model_config/transfo |
| | | | ier_dropout | nsformer_feedforward | ng_size | | ers | nal_encoding_dropout | rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING | 192.168.1.101:122098 | 0.432366 | 1272 | 276 | 46 | 5.24044 | 0.0990108 | 0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
(func pid=122098) 2023-04-09 16:34:22,464 - 2023-04-09 16:34:22_Testing Model 3 - INFO - Epoch 0/30
0%| | 0/9 [00:00<?, ?it/s]
(func pid=122098) experiment_runner size 22
== Status ==
Current time: 2023-04-09 16:34:27 (running for 00:00:12.14)
Memory usage on this node: 9.5/15.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | model_config/classif | model_config/dim_tra | model_config/embeddi | model_config/nhead | model_config/num_lay | model_config/positio | model_config/transfo |
| | | | ier_dropout | nsformer_feedforward | ng_size | | ers | nal_encoding_dropout | rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING | 192.168.1.101:122098 | 0.432366 | 1272 | 276 | 46 | 5.24044 | 0.0990108 | 0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
== Status ==
Current time: 2023-04-09 16:34:32 (running for 00:00:17.14)
Memory usage on this node: 12.3/15.5 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | model_config/classif | model_config/dim_tra | model_config/embeddi | model_config/nhead | model_config/num_lay | model_config/positio | model_config/transfo |
| | | | ier_dropout | nsformer_feedforward | ng_size | | ers | nal_encoding_dropout | rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING | 192.168.1.101:122098 | 0.432366 | 1272 | 276 | 46 | 5.24044 | 0.0990108 | 0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
== Status ==
Current time: 2023-04-09 16:34:37 (running for 00:00:22.16)
Memory usage on this node: 15.0/15.5 GiB : ***LOW MEMORY*** less than 10% of the memory on this node is available for use. This can cause unexpected crashes. Consider reducing the memory used by your application or reducing the Ray object store size by setting `object_store_memory` when calling `ray.init`.
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 1.0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 RUNNING)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | model_config/classif | model_config/dim_tra | model_config/embeddi | model_config/nhead | model_config/num_lay | model_config/positio | model_config/transfo |
| | | | ier_dropout | nsformer_feedforward | ng_size | | ers | nal_encoding_dropout | rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | RUNNING | 192.168.1.101:122098 | 0.432366 | 1272 | 276 | 46 | 5.24044 | 0.0990108 | 0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
Result for merge_params_and_run_99f58_00000:
date: 2023-04-09_16-34-22
experiment_id: 9180987839a844f881e7cd1b39dde698
hostname: vant-N2x0WU
node_ip: 192.168.1.101
pid: 122098
timestamp: 1681050862
trial_id: 99f58_00000
2023-04-09 16:34:37,539 ERROR trial_runner.py:1062 -- Trial merge_params_and_run_99f58_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File "<path>/venv/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
future_result = ray.get(ready_future)
File "<path>/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File <path>/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.101, ID: cd5b328d1d3ff2839d9b596f203a7c7ed02cbf8092f5ecd0cf7f9373) where the task (actor ID: 42adcfbd460cff245398d59601000000, name=ImplicitFunc.__init__, pid=122098, memory used=7.86GB) was running was 15.06GB / 15.52GB (0.970681), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.101`. To see the logs of the worker, use `ray logs worker-2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d*out -ip 192.168.1.101. Top 10 memory users:
PID MEM(GB) COMMAND
122098 7.86 ray::ImplicitFunc.train
9853 2.58 /home/raquel/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/223.8617.48/jbr/bin/java -classpath ...
121376 0.33 <path>/venv/bin/python<path>...
3506 0.22 ./jetbrains-toolbox --minimize
12732 0.21 /opt/google/chrome/chrome
4016 0.18 /usr/bin/gnome-software --gapplication-service
3107 0.16 /usr/bin/gnome-shell
110480 0.14 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
46629 0.13 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
12850 0.12 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
== Status ==
Current time: 2023-04-09 16:34:37 (running for 00:00:22.61)
Memory usage on this node: 15.0/15.5 GiB : ***LOW MEMORY*** less than 10% of the memory on this node is available for use. This can cause unexpected crashes. Consider reducing the memory used by your application or reducing the Ray object store size by setting `object_store_memory` when calling `ray.init`.
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/5.55 GiB heap, 0.0/2.78 GiB objects
Result logdir: /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14
Number of trials: 1/1 (1 ERROR)
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
| Trial name | status | loc | model_config/classif | model_config/dim_tra | model_config/embeddi | model_config/nhead | model_config/num_lay | model_config/positio | model_config/transfo |
| | | | ier_dropout | nsformer_feedforward | ng_size | | ers | nal_encoding_dropout | rmer_encoder_dropout |
|----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------|
| merge_params_and_run_99f58_00000 | ERROR | 192.168.1.101:122098 | 0.432366 | 1272 | 276 | 46 | 5.24044 | 0.0990108 | 0.0293209 |
+----------------------------------+----------+----------------------+------------------------+------------------------+------------------------+----------------------+------------------------+------------------------+------------------------+
Number of errored trials: 1
+----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| merge_params_and_run_99f58_00000 | 1 | /home/raquel/ray_results/merge_params_and_run_2023-04-09_16-34-14/merge_params_and_run_99f58_00000_0_classifier_dropout=0.4324,dim_transformer_feedforward=1272.0000,embedding_size=276.0000,nhead=4_2023-04-09_16-34-14/error.txt |
+----------------------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2023-04-09 16:34:37,553 ERROR ray_trial_executor.py:930 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
File "<path>/venv/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 921, in _resolve_stop_event
ray.get(future, timeout=timeout)
File "<path>/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "<path>/venv/lib/python3.8/site-packages/ray/_private/worker.py", line 2382, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 192.168.1.101, ID: cd5b328d1d3ff2839d9b596f203a7c7ed02cbf8092f5ecd0cf7f9373) where the task (actor ID: 42adcfbd460cff245398d59601000000, name=ImplicitFunc.__init__, pid=122098, memory used=7.86GB) was running was 15.06GB / 15.52GB (0.970681), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 192.168.1.101`. To see the logs of the worker, use `ray logs worker-2bfe01efb0b0bfc4f49317999680bd930769a54fb53fd0ee8783092d*out -ip 192.168.1.101. Top 10 memory users:
PID MEM(GB) COMMAND
122098 7.86 ray::ImplicitFunc.train
9853 2.58 /home/raquel/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/223.8617.48/jbr/bin/java -classpath ...
121376 0.33 <path>/venv/bin/python <path>...
3506 0.22 ./jetbrains-toolbox --minimize
12732 0.21 /opt/google/chrome/chrome
4016 0.18 /usr/bin/gnome-software --gapplication-service
3107 0.16 /usr/bin/gnome-shell
110480 0.14 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
46629 0.13 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
12850 0.12 /opt/google/chrome/chrome --type=renderer --crashpad-handler-pid=12741 --enable-crash-reporter=d5cd3...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-04-09 16:34:37,557 ERROR tune.py:794 -- Trials did not complete: [merge_params_and_run_99f58_00000]
2023-04-09 16:34:37,557 INFO tune.py:798 -- Total run time: 22.84 seconds (22.60 seconds for the tuning loop).
Process finished with exit code 0
I can also see on the system monitor this process taking up 8GB of memory:
Thanks in advance!