Cannot identify which ObjectRef causes a memory leak and results in large object store spills

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello. I’m having trouble identifying a memory leak arising when training an RL agent with RLlib in my custom Gymnasium environment. If you need some specific information about it I can provide it, but I cannot provide the full replica of the environment definition due to an NDA. Unfortunately, I have not been able to isolate what specific thing in the environment is causing the leak. The leak is causing large amounts of data to be spilled from the object store, eventually completely filling up the storage drive. Note that I’m using Ray 2.4.0 as I need the older version.

I made sure to put RAY_record_ref_creation_sites = 1 into my environment vars before calling ray.init(). This is the output of ray summary objects just a few seconds into the training:

UserWarning: The returned data may contain incomplete result. 676 (680 total from the cluster) objects are retrieved from the data source. 4 entries have been truncated. Max of 676 entries are retrieved from data source to prevent over-sized payloads.
  warnings.warn(

======== Object Summary: 2024-05-29 17:27:41.198415 ========
Stats:
------------------------------------
callsite_enabled: true
total_objects: 676
total_size_mb: 3378.275230407715


Table (group by callsite)
------------------------------------
C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419
    REF_TYPE_COUNTS            TASK_STATE_COUNTS        TOTAL_NUM_NODES    TOTAL_NUM_WORKERS    TOTAL_OBJECTS    TOTAL_SIZE_MB
--  -------------------------  -----------------------  -----------------  -------------------  ---------------  ---------------
0   LOCAL_REFERENCE: 12        FINISHED: 652            1                  1                    664              3357.68
    USED_BY_PENDING_TASK: 652  SUBMITTED_TO_WORKER: 12



C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\_private\worker.py:main_loop:844
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\_private\workers\default_worker.py:<module>:258
    REF_TYPE_COUNTS      TASK_STATE_COUNTS    TOTAL_NUM_NODES    TOTAL_NUM_WORKERS    TOTAL_OBJECTS    TOTAL_SIZE_MB
--  -------------------  -------------------  -----------------  -------------------  ---------------  ---------------
0   ACTOR_HANDLE: 4      '-': 8               1                  4                    8                20.5992
    PINNED_IN_MEMORY: 4



C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:968
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_invocation_actor_class_remote_span:381
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:remote:529
| C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\rllib\evaluation\worker_set.py:_make_worker:967
    REF_TYPE_COUNTS    TASK_STATE_COUNTS    TOTAL_NUM_NODES    TOTAL_NUM_WORKERS    TOTAL_OBJECTS    TOTAL_SIZE_MB
--  -----------------  -------------------  -----------------  -------------------  ---------------  ---------------
0   ACTOR_HANDLE: 4    FINISHED: 4          1                  1                    4                0

Note that the object store is already over 3 GB in size just a few seconds in. The output also seems to confirm that RAY_record_ref_creation_sites = 1 is correctly enabled.

The interesting part is that it seems like the creation site of the ObjectRef is… the tracing function itself? Just as a test, if I were to go into tracing_helper.py and remove the not _is_tracing_enabled() check at line 416 in the _start_span method, I would get a 'NoneType' object has no attribute 'get_tracer' at line 427, even though the RAY_record_ref_creation_sites = 1 flag is still enabled.

After a bit more testing, I noticed that the same files are implicated as the creation sites even when I use RandomEnv, so I guess that’s the intended behavior? But if so, how am I supposed to debug where in my env the leak occurs?

Hello,

I’m currently thinking of a useful way to debug this. In the meantime, do you have any sense as to whether the issue is a bug with RLlib specifically or with the code you have written?

Jack

Thank you. I think in a way, it’s a mixture of both, though probably mostly my code. I cannot replicate the issue when using the RandomEnv in ray.rllib.examples.env.random_env, only when using my custom env. On the other hand, if I run my env outside of Ray and just manually step through it with random actions, I see no obvious memory leaks.

Makes sense. I’m consulting with others to see if there is a good way to debug this without digging into Ray Core itself (which of course we don’t want to do since it would require compilation, etc.). In the meantime, are you able to inspect the contents of the data dumped to disk? Perhaps one thing we could do is print a prefix of each object from your Python code, and then try to match the prefix of the spilled object to one of the printed prefixes.

I thought of opening the spill files, but I didn’t figure out a way how. I thought they would just be pickled using Ray’s cloudpickle module as is mentioned in the documentation, but when I try to open them using the module I get a “pickle invalid load key \x00” error. As for just printing the identifier of the objects in my code and comparing it to the names of the spill files - can you give me a hint on how to print those? The Python id() function returns an int, whereas the names of the spill files looks like a UUID.

Actually, before we do that, can you please try the following?

  1. Set the RAY_record_ref_creation_sites environment variable to 1.
    export RAY_record_ref_creation_sites=1
  2. Start the workload in the same shell that has the RAY_record_ref_creation_sites environment variable set to 1.
  3. Run ray memory to see the existing ObjectRefs and the location where each was created.

Specifically, please run ray memory instead of ray summary objects, and paste the output into a reply. I included the full steps above to help others who may run into a similar issue.

Thanks!

Sure, here it is, generated a couple of minutes into training - note that I had to delete MANY lines in order to fit within the reply character limit, and many more were probably cut off by the terminal output limit, but all the entries I deleted looked identical to the others, save for the Object Ref:

127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000f6070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000b5050000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000005f070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000087070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000750a0000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000031020000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000c9040000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000028060000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000036030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000009b030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000080a0000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000095070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000cb000000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000009e020000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000f2030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000040010000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000060040000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000038030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000064030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000068070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000df070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000aa060000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000090010000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000fc080000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000008f090000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000007d030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000aa020000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000041000000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000d0030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000005e020000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000c4030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000001b030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000077040000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000001d030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000002a030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000dc050000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000f7000000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000e9090000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000cd060000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000083040000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000068010000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000015080000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000009f070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000a8060000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000fc070000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000007d080000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000d1050000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000d6020000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000009c080000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000003a030000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000025090000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff01000000350b0000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff010000006d0a0000


127.0.0.1     | 14112    | Driver  | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_actor_method_call:1169 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:invocation:171 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\actor.py:_remote:184 | C:\Users\Patrik\miniconda3\envs\jelly-rl\lib\site-packages\ray\util\tracing\tracing_helper.py:_start_span:419 | FINISHED  | 5439870.0 B | USED_BY_PENDING_TASK | 00ffffffffffffffffffffffffffffffffffffff0100000060010000

To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 7299 MiB, 1407 objects, 86.28% full, 0.61% needed
Spilled 13244 MiB, 2553 objects, avg write throughput 110 MiB/s
Objects consumed by Ray tasks: 20 MiB.

I believe the env variable was set properly, as it is both returned by Python in os.environ and by the echo command:

...>echo %RAY_record_ref_creation_sites%
1

Hi Jack. Got any more suggestions?
I spent the last couple of days trying to reduce what I could think of could result in large objects created by my environment, particularly any dictionaries being passed between the environment methods, and also using Ray put/get for any objects which remain unchanged since environment initialization and which are the same for every rollout worker. Unfortunately, it seems to have had 0 effect.

Do you think it would be worthwhile to test if I experience the same when running on Linux instead of Windows (though I already know I experienced the same on WSL)? Or is this amount of spillage expected given my configuration, and I should just invest in a few TBs of storage for this? Maybe it would help you if I disclosed some information about the RLlib configs I use?
I’m using the A3C agent (though I experience the same with DQN), TF2 framework, train_batch_size of 32, "complete_episodes" as the rollout batch_mode, sample_async=True, count_steps_by="env_steps" in the multi_agent config, num_rollout_workers=4, num_envs_per_worker is also 4 (and setting it to 0 has negligible effect). As part of environment initialization, I open an .exe using subprocess.Popen - this application is what the agent is actually stepping through. When the agent chooses an action, it is sent to the application using the requests library (I open a Session for this as part of the environment’s initialization as well) over localhost, and the application returns, among other things, a cost value used to compute the agent’s reward. This process is running pretty smoothly, there’s not really any slowdowns happening here (what I’m trying to say is that I don’t experience any sort of waiting around for something to finish, the environment is stepping smoothly, except for when the in-memory object store fills up and objects need to start getting loaded from the disk), and I experienced the large spillage happening even if I launched these subprocesses outside of the environment, and only gave the environment access to the port number over which to send the requests.

I may have managed to find a workaround to my issue - it seems that I do not run into this memory leak when using Ray Tune to train (without actually tuning any HPs), rather than the “traditional” manual calling of the .train() method of a built algorithm config for n episodes. I’ll check to see if this has in fact “resolved” the issue when I’m back in the office tomorrow, I’m letting things run overnight for now.
UPDATE: After letting things run overnight, I can say that the process is using about a third of the memory it was using with the manual .train() calls, and there’s 0 spillage whatsoever - ray summary objects reports that there’s only about 20 MBs worth of objects, which ray memory corroborates. However, things seem to eventually hang in some way - Tuner is still running, sending periodical status updates, but the environment is not being stepped at all anymore, it just hangs. I did not experience this with the .train()-based API, that one eventually just crashed due to running out of storage space for the spill files. Unfortunately, I have not been able to get things to continue using this approach.

Some more updates from the past two days:

  • Contrary to what I said in a previous reply, it seems that I do NOT encounter this issue when using the DQN algorithm, it seems to happen only when using A3C. That makes me suspect that the issue is not anywhere in my environment, but rather, a bug in RLlib somewhere.
  • Since I am unable to unpickle the spill files themselves (if you have any more suggestions on that, I’m all ears, but I suspect the reason for that might be that I never actually saw the training through, I’m trying to open spill files that remained behind when I interrupted Ray with CTRL+C), I tried to see if I could do a ray.get on one of the objects while things are running. I took the object ID from one of the objects returned by ray memory, I opened a separate iPython instance, I used ray.init to connect to the running cluster, and then I tried to do ray.get(ray.ObjectID(bytes.fromhex("<object ID here>")) to try to see what the object is. Needless to say, nothing moved hours in, though it makes sense - when I terminated the running cluster with CTRL+C, the error messages from the .get() call suggested it was waiting for the result of ray.wait(), though I couldn’t tell what it was waiting for.