Resume=True fails without useful error message

hridayns · September 11, 2022, 5:48am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m so close to finishing my training (Ray 2.0.0) and I ran out of disk space (apparently), so I cleared some out and tried to resume my training run with resume=true set for tune.run() but

2022-09-11 05:35:01,473	INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2022-09-11 05:35:02,695	INFO trial_runner.py:727 -- Using following checkpoint to resume: /.../tune-results/ma-pheromone/experiment_state-2022-09-10_12-31-05.json
2022-09-11 05:35:06,702	WARNING trial_runner.py:738 -- Attempting to resume experiment from /.../tune-results/ma-pheromone. This will ignore any new changes to the specification.
2022-09-11 05:35:15,328	INFO tune.py:652 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
Traceback (most recent call last):
  File "train.py", line 323, in <module>
    _main()
  File "train.py", line 317, in _main
    resume=True,
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5)])

How can I get more information? Please advise.

Also, on a side note (maybe related, keep_checkpoints_num=2 did not seem to care that I had 3 checkpoints already in my folder)

result = tune.run(
        'PPO',
        name='ma-pheromone',
        config=my_ppo_config.to_dict(),
        metric='episode_len_mean',
        mode='min',
        trial_name_creator=trial_str_creator,
        keep_checkpoints_num=2,
        checkpoint_score_attr='episode_len_mean',
        stop=stop,
        verbose=0,
        log_to_file=('train.log','error.log'),
        checkpoint_freq=5,
        checkpoint_at_end=True,
        local_dir='~/.../tune-results',
        resume=True,
    )

arturn · September 11, 2022, 4:01pm

Try increasing the verbosity! For example with verbose=2.
The output you are getting is indeed very little.

hridayns · September 12, 2022, 3:33am

I updated the verbosity to 2 and I get this information:

2022-09-12 03:23:12,823	INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2022-09-12 03:23:14,211	INFO trial_runner.py:727 -- Using following checkpoint to resume: /.../tune-results/ma-pheromone/experiment_state-2022-09-10_12-31-05.json
2022-09-12 03:23:18,640	WARNING trial_runner.py:738 -- Attempting to resume experiment from /.../tune-results/ma-pheromone. This will ignore any new changes to the specification.
2022-09-12 03:23:27,751	INFO tune.py:652 -- TrialRunner resumed, ignoring new add_experiment but updating trial resources.
== Status ==
Current time: 2022-09-12 03:24:28 (running for 00:01:01.02)
Memory usage on this node: 10.2/15.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs, 0.0/8.57 GiB heap, 0.0/4.29 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 70704_00000 with episode_len_mean=2564.6875 and parameters={...}
Result logdir: /.../tune-results/ma-pheromone
Number of trials: 1/1 (1 ERROR)
+------------------------------------------------------------------------------------------------+----------+-------------------+--------+------------------+-------+----------+------------------------+----------------------+----------------------+
| Trial name                                                                                     | status   | loc               |   iter |   total time (s) |    ts |   reward |   num_recreated_wor... |   episode_reward_max |   episode_reward_min |
|------------------------------------------------------------------------------------------------+----------+-------------------+--------+------------------+-------+----------+------------------------+----------------------+----------------------|
| PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5) | ERROR    | 192.168.0.42:2071 |     16 |          57488.9 | 82070 |  -809356 |                      0 |              -766859 |              -863800 |
+------------------------------------------------------------------------------------------------+----------+-------------------+--------+------------------+-------+----------+------------------------+----------------------+----------------------+
Number of errored trials: 1
+------------------------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                                                                                     |   # failures | error file                                                                                                                                                                                                           |
|------------------------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5) |            1 | /.../tune-results/ma-pheromone/PPO_nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5__0_2022-09-10_12-31-05/error.txt |
+------------------------------------------------------------------------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "train.py", line 323, in <module>
    _main()
  File "train.py", line 317, in _main
    resume=True,
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5)])

It says there is an error file called error.txt but I don’t see any error.txt file in the specified folder, and it’s very perplexing…? I have truncated the parameters that were printed because the output is too large and don’t know if it’s relevant.

Before trying the resume on the this, I decided to run it with “restore” checkpoint instead and that worked for the time being. That restored run also stopped after a while when my machine ran out of space. But when I resumed the trials for the restored run, it worked just fine. Very confused what changed between the two runs.

arturn · September 12, 2022, 8:34am

Hi again @hridayns ,

Can you let the ray dashboard run and watch how memory consumption builds up there?
What version of Ray are you using?
What env are you using?

Cheers

hridayns · September 12, 2022, 8:49am

I am using Ray 2.0.0, and my environment is a custom MultiAgentEnvironment.
When I start the trial with resume=True, I go to http://127.0.0.1:8265 for the dashboard and there I opened the memory consumption for the node which looks like this:

It does not seem to have any problems here as far as I can see. Where are the logs from the Ray dashboard stored, if anywhere, on Ubuntu? I can maybe fetch those logs for more information?

arturn · September 12, 2022, 9:01am

You’ll have to have an eye on the dashboard.
Or you keep the dashboard alive.

hridayns · September 12, 2022, 9:18am

Thank you! I have managed to keep the dashboard alive, and can look at the logs even after the crash happens now. Is there anything specific I should be looking at. There are multiple logs I can post here from:

dashboard.log
dashboard_agent.log
debug_state.txt
debug_state_gcs.txt
events/
gcs_server.err
gcs_server.out
log_monitor.log
monitor.err
monitor.log
monitor.out
old/
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_3866.log
python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_3700.log
ray_client_server.err
ray_client_server.out
raylet.err
raylet.out

arturn · September 12, 2022, 1:41pm

I’m not familiar with the particularities of the files your are listing (not a Ray core developer ) but have you had a look at the dashboard memory instructions?

hridayns · September 13, 2022, 4:30am

I have tried rerunning my training entirely from scratch, and this time I have a different stack trace for when it stops the first time (it was “no space left on device” before). The new stacktrace:

2022-09-13 02:52:40,608	WARNING worker.py:1829 -- Raylet is terminated: ip=192.168.0.42, id=bcb0721f05526f380c63a9bf12b961f1e708a269defc9e2e22ac800e. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [2022-09-13 02:52:34,265 C 118 174] (raylet) node_manager.cc:171: This node has beem marked as dead.
    *** StackTrace Information ***
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x55e37debf3ea] ray::operator<<()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47dee2) [0x55e37dec0ee2] ray::SpdLogMessage::Flush()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47e1f7) [0x55e37dec11f7] ray::RayLog::~RayLog()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x236834) [0x55e37dc79834] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x3623f4) [0x55e37dda53f4] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x3b7400) [0x55e37ddfa400] ray::rpc::GcsRpcClient::ReportHeartbeat()::{lambda()#2}::operator()()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x360262) [0x55e37dda3262] ray::rpc::ClientCallImpl<>::OnReplyReceived()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x21de05) [0x55e37dc60e05] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x460816) [0x55e37dea3816] EventTracker::RecordExecution()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x409f6e) [0x55e37de4cf6e] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x40a3e6) [0x55e37de4d3e6] boost::asio::detail::completion_handler<>::do_complete()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98ae2b) [0x55e37e3cde2b] boost::asio::detail::scheduler::do_run_one()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98c5f1) [0x55e37e3cf5f1] boost::asio::detail::scheduler::run()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98c820) [0x55e37e3cf820] boost::asio::io_context::run()
    /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x55e37e41e6e0] execute_native_thread_routine
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f712b6436db] start_thread
    /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f712a82561f] __clone
    

(raylet) [2022-09-13 02:52:34,265 C 118 174] (raylet) node_manager.cc:171: This node has beem marked as dead.
(raylet) *** StackTrace Information ***
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x55e37debf3ea] ray::operator<<()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47dee2) [0x55e37dec0ee2] ray::SpdLogMessage::Flush()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x47e1f7) [0x55e37dec11f7] ray::RayLog::~RayLog()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x236834) [0x55e37dc79834] std::_Function_handler<>::_M_invoke()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x3623f4) [0x55e37dda53f4] std::_Function_handler<>::_M_invoke()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x3b7400) [0x55e37ddfa400] ray::rpc::GcsRpcClient::ReportHeartbeat()::{lambda()#2}::operator()()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x360262) [0x55e37dda3262] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x21de05) [0x55e37dc60e05] std::_Function_handler<>::_M_invoke()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x460816) [0x55e37dea3816] EventTracker::RecordExecution()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x409f6e) [0x55e37de4cf6e] std::_Function_handler<>::_M_invoke()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x40a3e6) [0x55e37de4d3e6] boost::asio::detail::completion_handler<>::do_complete()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98ae2b) [0x55e37e3cde2b] boost::asio::detail::scheduler::do_run_one()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98c5f1) [0x55e37e3cf5f1] boost::asio::detail::scheduler::run()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x98c820) [0x55e37e3cf820] boost::asio::io_context::run()
(raylet) /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x55e37e41e6e0] execute_native_thread_routine
(raylet) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f712b6436db] start_thread
(raylet) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f712a82561f] __clone
(raylet) 
2022-09-13 02:53:12,457	WARNING resource_updater.py:52 -- Cluster resources not detected or are 0. Attempt #2...
2022-09-13 02:53:12,960	WARNING resource_updater.py:52 -- Cluster resources not detected or are 0. Attempt #3...
2022-09-13 02:53:13,462	WARNING resource_updater.py:52 -- Cluster resources not detected or are 0. Attempt #4...
2022-09-13 02:53:13,964	WARNING resource_updater.py:52 -- Cluster resources not detected or are 0. Attempt #5...
2022-09-13 02:53:14,466	WARNING resource_updater.py:65 -- Cluster resources cannot be detected or are 0. You can resume this experiment by passing in `resume=True` to `run`.
2022-09-13 02:53:14,466	WARNING util.py:220 -- The `on_step_begin` operation took 111.586 s, which may be a performance bottleneck.
2022-09-13 02:53:14,473	ERROR trial_runner.py:980 -- Trial PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5): Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

2022-09-13 02:54:02,264	ERROR ray_trial_executor.py:104 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
    ray.get(future, timeout=0)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Traceback (most recent call last):
  File "train.py", line 326, in <module>
    _main()
  File "train.py", line 318, in _main
    local_dir='~/devel/rllibsumoutils/pheromone-RL/pheromone-PPO/tune-results',
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5)])

And when I tried resuming this run:

2022-09-13 04:16:06,357	INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 192.168.0.42:6379...
[2022-09-13 04:16:06,365 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:07,366 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:08,366 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:09,367 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:10,368 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:11,369 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:12,369 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:13,370 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
^C[2022-09-13 04:16:14,371 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2022-09-13 04:16:15,372 W 29933 29933] global_state_accessor.cc:390: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?

So i restarted the docker container and now I am stuck in the same situation as before - in the start of this question. I should also note that I have an additional free 20 GB of space since I resized my parition. Please advise. Any pointers? Looking at the memory logs didn’t seem to help me. Do these stack traces point at anything specific? I can’t tell.

hridayns · September 13, 2022, 8:58am

I have a hunch that my GPU % reaches 100% and that’s when the error pops up, what would I need to change to bring this down?

arturn · September 13, 2022, 10:39am

That is entirely possible. We still have not seen the actual error message so it’s hard to tell. You can watch your GPU memory with watch -n 0.1 nvidia-smi.

Lars_Simon_Zehnder · September 13, 2022, 11:05am

@hridayns as you are using Ray 2.0.0 the following should be part of it now: For resuming errored-out trials you need to tell the Tune runner that the errored-out trials should be rerun (or restarted). This can be done by setting resume="LOCAL+ERRORED" (or for a restart, i.e. no former checkpoint is used, resume="LOCAL+RESTART_ERRORED").

If you are running with a remote checkpoint folder (for example GCS or S3) you use resume="REMOTE+ERRORED" (for this the upload_dir should point to the remote folder). Or, if you want to start afresh the errored trials only use resume="REMOTE+RESTART_ERRORED_ONLY.

See the function documentation for more info.

hridayns · September 13, 2022, 6:27pm

Thank you @Lars_Simon_Zehnder ! Will definitely try this since I have another run going on at the moment. When it fails, I will try to resume it with that information in mind!

Meanwhile… @arturn - There seems to be something hogging up my system RAM - it’s steadily growing. Over time, I think I will have a similar problem to a question posted by someone else on SO: memory management - RAM Usage Keeps Going Up While Training an RL Network Using RLLib and TensorFlow - Stack Overflow

On a side note, does resume restore the random state of variables (if I am, for example, using Numpy’s default_rng() to set a random state seed.)?

arturn · September 13, 2022, 7:07pm

@hridayns Restoring will not restore the state of any RNG. It will only restore seeds set by the config you input. We are aware of and working on such reproducibility topics as they are important to the research community.

Can you find out what’s hogging your RAM? Does the same thing happen when you run on nightlies?

hridayns · September 13, 2022, 8:20pm

So I was going through the Ray dashboard memory debugging instructions from the link you posted earlier (which seem to be instructions for the legacy accent):

Debugging ObjectStoreFullError and Memory Leaks

You can view information for Ray objects in the memory tab. It is useful to debug memory leaks, especially ObjectStoreFullError.

One common cause of these memory errors is that there are objects which never go out of scope. In order to find these, you can go to the Memory View, then select to “Group By Stack Trace.” This groups memory entries by their stack traces up to three frames deep. If you see a group which is growing without bound, you might want to examine that line of code to see if you intend to keep that reference around.

The memory view is just a constant Loading

memory dashboard1833×227 13.9 KB

Note that this is the same information as displayed in the ray memory command. For details about the information contained in the table, please see the ray memory documentation.

How do I run ray memory command on my Docker container while I am already running my training script?

Inspect Memory Usage

You can detect local memory anomalies through the Logical View tab. If NumObjectRefsInScope, NumLocalObjects, or UsedLocalObjectMemory keeps growing without bound, it can lead to out of memory errors or eviction of objectIDs that your program still wants to use.

object ref1771×909 76.5 KB

NumObjectRefsInScope does not change much from 4 for the PPO actor, stays at 1 for the 2 rollout workers.

NumLocalObjects and UsedLocalObjectMemory seem to always remain at 0 - not sure what this means exactly?

This view does not show information about heap memory usage . That means if your actor or task is allocating too much memory locally (not in the Ray object store a.k.a Plasma), this view will not help you find it.

Seeing how I believe that my RAM usage might max. out at some point (it is at 50% right now), would this count as the ‘heap memory usage’? Couldn’t find much information on how to debug that through Ray - except for maybe the experimental profiling button (which i did not see on the Logical view of my Ray dashboard).

I will try running this on Ray’s nightly version if it fails again with this run (however, that might introduce another set of challenges). I hope to know by tomorrow morning. Sorry again for the trouble, and thank you for the continued support.

hridayns · September 14, 2022, 4:34am

Hello,

The run stopped in the morning.

The original error:

2022-09-13 21:31:06,324	WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffc00821a3d8c8e614fe5fde3601000000 Worker ID: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847 Node ID: a561b0101bc5614e82e2d5b4f90dec765174a37a62e847e9de0dc96f Worker IP address: 192.168.0.42 Worker port: 10005 Worker PID: 374 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2022-09-13 21:31:12,929	ERROR trial_runner.py:980 -- Trial PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5): Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: PPO
	actor_id: c00821a3d8c8e614fe5fde3601000000
	pid: 374
	namespace: 2a3ac498-b1a4-4122-a924-87602cd4a194
	ip: 192.168.0.42
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

2022-09-13 21:32:09,939	ERROR ray_trial_executor.py:104 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
    ray.get(future, timeout=0)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: PPO
	actor_id: c00821a3d8c8e614fe5fde3601000000
	pid: 374
	namespace: 2a3ac498-b1a4-4122-a924-87602cd4a194
	ip: 192.168.0.42
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Traceback (most recent call last):
  File "train.py", line 326, in <module>
    _main()
  File "train.py", line 318, in _main
    local_dir='~/devel/rllibsumoutils/pheromone-RL/pheromone-PPO/tune-results',
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5)])

There are two python core driver logs for some reason. I don’t understand why…both their logs started at the same time.

python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_279.log - PID: 279
python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_58.log - PID: 58

Some error messages were found in the first one, though they are timestamped at 2022-09-13 21:31:03, which is yesterday in the evening, when it was seeming to run smoothly.:


[2022-09-13 21:31:03,394 I 279 301] task_manager.cc:446: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.algorithms.ppo.ppo, class_name=PPO, function_name=train, function_hash=}, task_id=dc746dc61b2c1923c00821a3d8c8e614fe5fde3601000000, task_name=PPO.train(), job_id=01000000, num_args=0, num_returns=2, depth=0, actor_task_spec={actor_id=c00821a3d8c8e614fe5fde3601000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=19}
[2022-09-13 21:31:03,396 I 279 301] direct_actor_task_submitter.cc:539: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=dc746dc61b2c1923c00821a3d8c8e614fe5fde3601000000, wait_queue_size=133799
[2022-09-13 21:31:03,407 I 279 301] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: c00821a3d8c8e614fe5fde3601000000, ip address: 192.168.0.42, port: 10005, worker_id: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847, raylet_id: a561b0101bc5614e82e2d5b4f90dec765174a37a62e847e9de0dc96f, num_restarts: 0, death context type=ActorDiedErrorContext
[2022-09-13 21:31:03,407 I 279 301] direct_actor_task_submitter.cc:278: Failing pending tasks for actor c00821a3d8c8e614fe5fde3601000000 because the actor is already dead.33801
[2022-09-13 21:31:12,998 I 279 279] task_manager.cc:446: Task failed: IOError: cancelling task of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.algorithms.ppo.ppo, class_name=PPO, function_name=stop, function_hash=}, task_id=c76a79b2875a7251c00821a3d8c8e614fe5fde3601000000, task_name=PPO.stop(), job_id=01000000, num_args=0, num_returns=2, depth=0, actor_task_spec={actor_id=c00821a3d8c8e614fe5fde3601000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=20}33802
[2022-09-13 21:31:49,659 I 279 301] core_worker.cc:476: Event stats:

Queueing time: mean = 151.141 us, max = 4.874 s, min = -0.094 s, total = 123.346 s
Execution time:  mean = 82.065 us, total = 66.973 s

Event stats:
UNKNOWN - 641306 total (4 active, 1 running), CPU time: mean = 18.916 us, total = 12.131 s
CoreWorker.deadline_timer.flush_profiling_events - 58242 total (1 active), CPU time: mean = 4.910 us, total = 285.959 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 58241 total (0 active), CPU time: mean = 8.039 us, total = 468.190 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 57882 total (0 active), CPU time: mean = 51.602 us, total = 2.987 s
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 120 total (1 active), CPU time: mean = 295.524 us, total = 35.463 ms
CoreWorkerService.grpc_server.LocalGC - 97 total (0 active), CPU time: mean = 526.200 ms, total = 51.041 s
CoreWorkerService.grpc_server.PubsubCommandBatch - 55 total (0 active), CPU time: mean = 52.427 us, total = 2.883 ms
CoreWorkerService.grpc_server.PubsubLongPolling - 45 total (0 active), CPU time: mean = 7.810 us, total = 351.431 us
CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 27 total (0 active), CPU time: mean = 66.997 us, total = 1.809 ms
CoreWorkerService.grpc_client.PushTask - 21 total (0 active), CPU time: mean = 494.822 us, total = 10.391 ms
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 20 total (0 active), CPU time: mean = 179.994 us, total = 3.600 ms
StatsGcsService.grpc_client.AddProfileData - 20 total (0 active), CPU time: mean = 13.241 us, total = 264.825 us
PeriodicalRunner.RunFnPeriodically - 5 total (0 active), CPU time: mean = 44.210 us, total = 221.050 us
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 4 total (0 active), CPU time: mean = 630.415 us, total = 2.522 ms
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 100.660 us, total = 201.320 us
CoreWorkerService.grpc_server.WaitForActorOutOfScope - 1 total (0 active), CPU time: mean = 12.157 us, total = 12.157 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 36.506 us, total = 36.506 us
ActorInfoGcsService.grpc_client.CreateActor - 1 total (0 active), CPU time: mean = 32.697 us, total = 32.697 us
NodeManagerService.grpc_client.RequestWorkerLease - 1 total (0 active), CPU time: mean = 260.194 us, total = 260.194 us
PlacementGroupInfoGcsService.grpc_client.CreatePlacementGroup - 1 total (0 active), CPU time: mean = 66.894 us, total = 66.894 us
ActorCreator.AsyncRegisterActor - 1 total (0 active), CPU time: mean = 126.684 us, total = 126.684 us
CoreWorker.SubmitTask - 1 total (0 active), CPU time: mean = 713.898 us, total = 713.898 us
ActorInfoGcsService.grpc_client.RegisterActor - 1 total (0 active), CPU time: mean = 94.697 us, total = 94.697 us
NodeManagerService.grpc_client.ReturnWorker - 1 total (0 active), CPU time: mean = 5.393 us, total = 5.393 us
ActorInfoGcsService.grpc_client.GetActorInfo - 1 total (0 active), CPU time: mean = 469.095 us, total = 469.095 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.321 us, total = 11.321 us


[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:593: Disconnecting to the raylet.33838
[2022-09-13 21:32:11,078 I 279 279] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:540: Shutting down a core worker.33840
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:564: Disconnecting a GCS client.33841
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.33842
[2022-09-13 21:32:11,078 I 279 301] core_worker.cc:691: Core worker main io service stopped.33843
[2022-09-13 21:32:11,079 I 279 279] core_worker.cc:577: Core worker ready to be deallocated.33844
[2022-09-13 21:32:11,079 I 279 279] core_worker.cc:531: Core worker is destructed33845
[2022-09-13 21:32:11,226 I 279 279] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 27933846
[2022-09-13 21:32:11,226 I 279 279] io_service_pool.cc:47: IOServicePool is stopped.

Comparison of two points in time (earlier and end) in the log file for the worker ID: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847 (as noted in the original error) as found in file python-core-worker-4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847_374.log:

[2022-09-13 08:04:50,628 I 374 394] core_worker.cc:476: Event stats:
Global stats: 147632 total (7 active)
Queueing time: mean = 270.607 us, max = 210.064 ms, min = -0.011 s, total = 39.950 s
Execution time: mean = 42.538 us, total = 6.280 s
Event stats:
UNKNOWN - 117965 total (5 active, 1 running), CPU time: mean = 20.876 us, total = 2.463 s
CoreWorker.deadline_timer.flush_profiling_events - 9837 total (1 active), CPU time: mean = 8.733 us, total = 85.909 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 9836 total (0 active), CPU time: mean = 10.073 us, total = 99.076 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 9787 total (0 active), CPU time: mean = 103.127 us, total = 1.009 s
CoreWorkerService.grpc_server.PubsubCommandBatch - 39 total (0 active), CPU time: mean = 40.725 us, total = 1.588 ms
CoreWorkerService.grpc_server.PubsubLongPolling - 29 total (0 active), CPU time: mean = 9.669 us, total = 280.390 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 24 total (1 active), CPU time: mean = 178.458 us, total = 4.283 ms
CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 21 total (0 active), CPU time: mean = 64.287 us, total = 1.350 ms
CoreWorkerService.grpc_server.LocalGC - 16 total (0 active), CPU time: mean = 163.032 ms, total = 2.609 s
CoreWorkerService.grpc_client.PushTask - 15 total (0 active), CPU time: mean = 104.711 us, total = 1.571 ms
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 15 total (0 active), CPU time: mean = 90.475 us, total = 1.357 ms
StatsGcsService.grpc_client.AddProfileData - 11 total (0 active), CPU time: mean = 26.087 us, total = 286.960 us
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 6 total (0 active), CPU time: mean = 98.578 us, total = 591.468 us
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 44.730 us, total = 268.378 us
NodeManagerService.grpc_client.PinObjectIDs - 5 total (0 active), CPU time: mean = 20.112 us, total = 100.562 us
CoreWorkerService.grpc_server.PushTask - 5 total (0 active), CPU time: mean = 60.971 us, total = 304.856 us
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 3 total (0 active), CPU time: mean = 72.790 us, total = 218.371 us
ActorInfoGcsService.grpc_client.GetActorInfo - 2 total (0 active), CPU time: mean = 385.013 us, total = 770.026 us
ActorInfoGcsService.grpc_client.RegisterActor - 2 total (0 active), CPU time: mean = 331.548 us, total = 663.095 us
ActorInfoGcsService.grpc_client.CreateActor - 2 total (0 active), CPU time: mean = 86.492 us, total = 172.983 us
ActorCreator.AsyncRegisterActor - 2 total (0 active), CPU time: mean = 274.019 us, total = 548.038 us
CoreWorkerService.grpc_server.WaitForActorOutOfScope - 2 total (0 active), CPU time: mean = 14.777 us, total = 29.555 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 41.465 us, total = 41.465 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.189 us, total = 11.189 us

.
.
.


Global stats: 872170 total (8 active)
Queueing time: mean = 416.588 us, max = 1.372 s, min = -0.027 s, total = 363.336 s
Execution time:  mean = 63.779 us, total = 55.626 s
Event stats:
	UNKNOWN - 696904 total (5 active, 1 running), CPU time: mean = 23.622 us, total = 16.462 s
	CoreWorker.deadline_timer.flush_profiling_events - 58166 total (1 active), CPU time: mean = 7.395 us, total = 430.154 ms
	NodeManagerService.grpc_client.ReportWorkerBacklog - 58165 total (0 active), CPU time: mean = 11.117 us, total = 646.627 ms
	CoreWorkerService.grpc_server.GetCoreWorkerStats - 57825 total (0 active), CPU time: mean = 106.332 us, total = 6.149 s
	CoreWorkerService.grpc_server.PubsubCommandBatch - 264 total (0 active), CPU time: mean = 56.212 us, total = 14.840 ms
	CoreWorkerService.grpc_server.PubsubLongPolling - 194 total (0 active), CPU time: mean = 16.010 us, total = 3.106 ms
	CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 143 total (0 active), CPU time: mean = 139.546 us, total = 19.955 ms
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 106 total (1 active), CPU time: mean = 317.772 us, total = 33.684 ms
	CoreWorkerService.grpc_server.LocalGC - 97 total (0 active), CPU time: mean = 328.146 ms, total = 31.830 s
	CoreWorkerService.grpc_client.PushTask - 78 total (1 active), CPU time: mean = 222.635 us, total = 17.366 ms
	CoreWorkerDirectActorTaskSubmitter::SubmitTask - 78 total (0 active), CPU time: mean = 147.249 us, total = 11.485 ms
	StatsGcsService.grpc_client.AddProfileData - 70 total (0 active), CPU time: mean = 21.250 us, total = 1.487 ms
	NodeManagerService.grpc_client.PinObjectIDs - 32 total (0 active), CPU time: mean = 32.621 us, total = 1.044 ms
	CoreWorkerService.grpc_server.PushTask - 21 total (0 active), CPU time: mean = 85.503 us, total = 1.796 ms
	Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 6 total (0 active), CPU time: mean = 98.578 us, total = 591.468 us
	PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 44.730 us, total = 268.378 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 3 total (0 active), CPU time: mean = 72.790 us, total = 218.371 us
	ActorInfoGcsService.grpc_client.GetActorInfo - 2 total (0 active), CPU time: mean = 385.013 us, total = 770.026 us
	ActorInfoGcsService.grpc_client.RegisterActor - 2 total (0 active), CPU time: mean = 331.548 us, total = 663.095 us
	ActorInfoGcsService.grpc_client.CreateActor - 2 total (0 active), CPU time: mean = 86.492 us, total = 172.983 us
	ActorCreator.AsyncRegisterActor - 2 total (0 active), CPU time: mean = 274.019 us, total = 548.038 us
	CoreWorkerService.grpc_server.WaitForActorOutOfScope - 2 total (0 active), CPU time: mean = 14.777 us, total = 29.555 us
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 41.465 us, total = 41.465 us
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.189 us, total = 11.189 us

I am not able to figure out the actual nature of the problem, though the hint is that it has something to do with memory. I have the logs with me, any suggestions what I should be looking at from here? Thank you.

hridayns · September 14, 2022, 4:37am

The resume seems to be working with ‘LOCAL+ERRORED’! Not sure if this will affect performance negatively, since I don’t know what went wrong in the first place! Will let you know! Thanks

arturn · September 14, 2022, 8:53am

This error looks like the effect of the OOM you mention.
You are not the only one reporting OOMs on PPO btw.
Can you please post your config?
My best guess right now is that this is a malfunction of RLlib.

Cheers

hridayns · September 14, 2022, 9:00am

My config looks like this:


    agent_ids = []
    for i in range(NUM_AGENTS):
        agent_ids.append('agent_{}'.format(i+1))

    ## MARL Environment Init
    env_config = {
        'agent_ids' : agent_ids,
        'scenario_config': scenario_config,
    }
    marl_env = marlenvironment.SUMOTestMultiAgentEnv(env_config)

    # Model
    model = deepcopy(MODEL_DEFAULTS)
    model['fcnet_hiddens'] = [512, 512] # 256, 256
    model['fcnet_activation'] = 'tanh'
    model['conv_filters'] = None
    model['conv_activation'] = 'relu'
    model['post_fcnet_hiddens'] = []
    model['post_fcnet_activation'] = 'relu'
    model['free_log_std'] = False
    model['no_final_linear'] = False
    model['vf_share_layers'] = False
    model['use_lstm'] = False
    model['max_seq_len'] = 20
    model['lstm_cell_size'] = 256
    model['lstm_use_prev_action'] = False
    model['lstm_use_prev_reward'] = False
    model['_time_major'] = False


    # Config for PPO
    policies = {}
    policies['my_ppo'] = (PPOTFPolicy,
                           marl_env.get_obs_space('agent_1'),
                           marl_env.get_action_space('agent_1'),
                           {})


    # Algorithm.
    my_ppo_config = PPOConfig()\
        .python_environment()\
        .resources(
            num_gpus=1,
            num_cpus_per_worker=1,
            num_gpus_per_worker=0,
        )\
        .framework(
            framework='tf',
            eager_tracing=False,
        )\
        .environment(
            env='pheromone_env',
            env_config=env_config,
            observation_space=None,
            action_space=None,
            clip_rewards=None,
            normalize_actions=False, # default is True
            clip_actions=False,
            disable_env_checking=True,
        )\
        .rollouts(
            num_rollout_workers = 1,
            num_envs_per_worker = 1,
            # sample_collector = SimpleListCollector,
            # create_env_on_local_worker = False,
            # sample_async = False,
            rollout_fragment_length = 400,
            batch_mode = 'complete_episodes',
            # horizon = None,
            # soft_horizon = False,
            # no_done_at_end = False,
            observation_filter = 'NoFilter',
        )\
        .training(
            gamma=0.99,
            lr=5e-05,
            train_batch_size=4000,
            model=model,
            lr_schedule=None,
            use_critic=True,
            use_gae=True,
            lambda_=1.0,
            kl_coeff=0.2,
            sgd_minibatch_size=128,
            num_sgd_iter=30,
            shuffle_sequences=True,
            vf_loss_coeff=1.0,
            entropy_coeff=0.0,
            entropy_coeff_schedule=None,
            clip_param=0.3,
            vf_clip_param=10,
            grad_clip=None,
            kl_target=0.01,
        )\
        .callbacks(MyCallbacks)\
        .exploration(
            explore=True,
            exploration_config={'type': 'StochasticSampling'}
        )\
        .multi_agent(
            policies = policies,
            policy_map_capacity = 100,
            policy_map_cache = None,
            policy_mapping_fn = lambda agent_id: 'my_ppo',
            policies_to_train = ['my_ppo'],
            observation_fn = None,
            replay_mode = 'independent',
            count_steps_by = 'env_steps',
        )\
        .offline_data(
            # postprocess_inputs=False,
        )\
        .evaluation(
            evaluation_interval = 10,
            evaluation_duration = 10,
            evaluation_duration_unit = 'episodes',
            # evaluation_sample_timeout_s = 180.0,
            evaluation_parallel_to_training = False,
            evaluation_config = {
               'explore': False,
               'exploration_config' : {'type': 'StochasticSampling'}
            },
            evaluation_num_workers = 1,
            # custom_evaluation_function = None
            always_attach_evaluation_results = True,
            # in_evaluation = False,
            # sync_filters_on_rollout_workers_timeout_s = 60.0
            evaluation_sample_timeout_s=3600,
        )\
        .reporting(
            keep_per_episode_custom_metrics = True, # default is False
            metrics_episode_collection_timeout_s = 60.0,
            metrics_num_episodes_for_smoothing = 100,
            min_time_s_per_iteration = 300,
            min_train_timesteps_per_iteration = 0,
            min_sample_timesteps_per_iteration = 0,
        )\
        .debugging(
            log_level='WARN',
            seed=42
        )

    stop = tune.stopper.TrialPlateauStopper(
        metric='episode_reward_mean',
        std=0.01,
        num_results=4,
        grace_period=4,
    )



result = tune.run(
        'PPO',
        name='ma-pheromone-newdashboard',
        config=my_ppo_config.to_dict(),
        metric='episode_len_mean',
        mode='min',
        trial_name_creator=trial_str_creator,
        keep_checkpoints_num=2,
        checkpoint_score_attr='episode_len_mean',
        stop=stop,
        verbose=0,
        log_to_file=('train.log','error.log'),
        checkpoint_freq=5,
        checkpoint_at_end=True,
        local_dir='~/.../tune-results',   
        resume='LOCAL+ERRORED',
    )

arturn · September 14, 2022, 4:04pm

I can’t reproduce without the particularities of the env. But the configs look fine. We have a test that tests PPO for memory leaks so it’s unlikely to be that. Other than that, sampling randomly for large gym spaces can cause such memory leaks.

Until there is a full reproduction script, I fear I can not reproduce on one of our clusters.

Cheers

Topic		Replies	Views
FileNotFoundError when resuming from Checkpoint Ray Tune	4	1303	August 11, 2022
I cannot resume a broken tune run	2	449	September 10, 2023
Issue while resuming "ERRORED" trials Ray Tune	2	406	September 21, 2021
Not able to resume experiment Ray Tune	5	942	December 12, 2022
Does Ray Tune restore ignore max_concurrent_trials when restarting errored trials? Ray Tune	2	272	June 30, 2023

Resume=True fails without useful error message

Debugging ObjectStoreFullError and Memory Leaks

Inspect Memory Usage

Related topics