RayOutOfMemoryError

Clement_Collgon · May 24, 2021, 10:10am

Hello, I’m using ray RLLIB. It works perfectly well on all computers of my office, however, one with a bit more power gives us problem. When we launch runs on it, the memory will rise until it reaches maximum ( no dump of memory it seems ) and kill all active process one after another.

Error :

ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node zeta-z1 is used (59.52 / 62.49 GB). The top 10 memory consumers are:

PID MEM COMMAND
553077 1.76GiB ray::ImplicitFunc.train_buffered()
553084 1.73GiB ray::ImplicitFunc.train_buffered()
553083 1.73GiB ray::ImplicitFunc.train_buffered()
553074 1.71GiB ray::ImplicitFunc.train_buffered()
553086 1.52GiB ray::ImplicitFunc.train_buffered()
553085 1.5GiB ray::ImplicitFunc.train_buffered()
553131 1.49GiB ray::ImplicitFunc.train_buffered()
553082 1.48GiB ray::ImplicitFunc.train_buffered()
553087 1.46GiB ray::ImplicitFunc.train_buffered()
553071 1.46GiB ray::ImplicitFunc.train_buffered()

Tensorboard :

Thanks in advance !

sven1977 · May 24, 2021, 10:18am

Hey @Clement_Collgon , can you give us more information on what Ray version, algorithm, RLlib config you are running with? Hard to diagnose from this distance w/o more information. Thanks.

Clement_Collgon · May 24, 2021, 10:24am

Dear Sven,

the ray version 1.3.0,
the RLLIB config is :
“rllib”: {
“grid_search”: {
“_fake_gpus”: false,
“actor_hidden_activation”: “relu”,
“actor_hiddens”: [
512,
512
],
“actor_lr”: 0.0001,
“batch_mode”: “truncate_episodes”,
“buffer_size”: 300000,
“callbacks”: “<class ‘ray.rllib.agents.callbacks.DefaultCallbacks’>”,
“clip_actions”: true,
“clip_rewards”: null,
“collect_metrics_timeout”: 180,
“compress_observations”: false,
“create_env_on_driver”: false,
“critic_hidden_activation”: “relu”,
“critic_hiddens”: [
512,
512
],
“critic_lr”: 0.001,
“custom_eval_function”: null,
“custom_resources_per_worker”: {},
“eager_tracing”: false,
“env”: null,
“env_config”: {},
“evaluation_config”: {
“env_config”: {},
“explore”: false
},
“evaluation_interval”: null,
“evaluation_num_episodes”: 10,
“evaluation_num_workers”: 0,
“exploration_config”: {
“final_scale”: 1.0,
“initial_scale”: 1.0,
“stddev”: 0.12,
“type”: “GaussianNoise”
},
“explore”: true,
“extra_python_environs_for_driver”: {},
“extra_python_environs_for_worker”: {},
“fake_sampler”: false,
“final_prioritized_replay_beta”: 0.4,
“framework”: “tf”,
“gamma”: 0.9,
“grad_clip”: null,
“horizon”: null,
“huber_threshold”: 1.0,
“ignore_worker_failures”: false,
“in_evaluation”: false,
“input”: “sampler”,
“input_evaluation”: [
“is”,
“wis”
],
“l2_reg”: 0,
“learning_starts”: 1500,
“local_tf_session_args”: {
“inter_op_parallelism_threads”: 8,
“intra_op_parallelism_threads”: 8
},
“log_level”: “WARN”,
“log_sys_usage”: true,
“logger_config”: null,
“lr”: 0.0001,
“metrics_smoothing_episodes”: 100,
“min_iter_time_s”: 1,
“model”: {
“_time_major”: false,
“conv_activation”: “relu”,
“conv_filters”: null,
“custom_action_dist”: null,
“custom_model”: null,
“custom_model_config”: {},
“custom_preprocessor”: null,
“dim”: 84,
“fcnet_activation”: “tanh”,
“fcnet_hiddens”: [
256,
256
],
“framestack”: true,
“free_log_std”: false,
“grayscale”: false,
“lstm_cell_size”: 256,
“lstm_use_prev_action”: false,
“lstm_use_prev_action_reward”: -1,
“lstm_use_prev_reward”: false,
“max_seq_len”: 20,
“no_final_linear”: false,
“use_lstm”: false,
“vf_share_layers”: true,
“zero_mean”: true
},
“monitor”: -1,
“multiagent”: {
“count_steps_by”: “env_steps”,
“observation_fn”: null,
“policies”: {},
“policies_to_train”: null,
“policy_mapping_fn”: null,
“replay_mode”: “independent”
},
“n_step”: 1,
“no_done_at_end”: false,
“normalize_actions”: false,
“num_cpus_for_driver”: 1,
“num_cpus_per_worker”: 0,
“num_envs_per_worker”: 1,
“num_gpus”: 0,
“num_gpus_per_worker”: 0,
“num_workers”: 10,
“observation_filter”: “NoFilter”,
“optimizer”: {},
“output”: null,
“output_compress_columns”: [
“obs”,
“new_obs”
],
“output_max_file_size”: 67108864,
“placement_strategy”: “PACK”,
“policy_delay”: 2,
“postprocess_inputs”: false,
“preprocessor_pref”: “deepmind”,
“prioritized_replay”: false,
“prioritized_replay_alpha”: 0.6,
“prioritized_replay_beta”: 0.4,
“prioritized_replay_beta_annealing_timesteps”: 20000,
“prioritized_replay_eps”: 1e-06,
“record_env”: false,
“remote_env_batch_wait_ms”: 0,
“remote_worker_envs”: false,
“render_env”: false,
“rollout_fragment_length”: 1,
“sample_async”: false,
“sample_collector”: “<class ‘ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector’>”,
“seed”: null,
“shuffle_buffer_size”: 0,
“simple_optimizer”: -1,
“smooth_target_policy”: false,
“soft_horizon”: false,
“synchronize_filters”: true,
“target_network_update_freq”: 0,
“target_noise”: 0.12,
“target_noise_clip”: 1.0,
“tau”: 0.001,
“tf_session_args”: {
“allow_soft_placement”: true,
“device_count”: {
“CPU”: 1
},
“gpu_options”: {
“allow_growth”: true
},
“inter_op_parallelism_threads”: 2,
“intra_op_parallelism_threads”: 2,
“log_device_placement”: false
},
“timesteps_per_iteration”: 1000,
“train_batch_size”: 512,
“training_intensity”: null,
“twin_q”: true,
“use_huber”: false,
“use_state_preprocessor”: false,
“worker_side_prioritization”: false
},
“num_samples”: 1,
“policies_config”: {},
“ressources_GS”: {
“cpu”: 2,
“gpu”: 0
},
“trainer”: “DDPG”
},
with the DDPG algorithm.
The processor model name : Intel(R) Core™ i9-10920X CPU @ 3.50GHz
The memory available : 64GB.

Thanks a lot

Topic		Replies	Views
[RLlib] Ray Out Of Memory Error RLlib	2	1282	June 14, 2021
[RLlib] GPU Memory Leak? Tune + PPO, Policy Server + Client RLlib	18	1218	May 29, 2023
Large memory usesage by dashboard Dashboard, Monitoring & Debugging	3	549	October 8, 2022
RayOutOfMemoryError: More than 95% of the memory is used Ray Core	6	4867	September 9, 2022
Help debugging a memory leak in rllib RLlib	21	3894	September 25, 2022

RayOutOfMemoryError

Related topics