I am new in ray and currently experimenting with it. I’ve written already some similar functions like ray has and try to migrate my framework to ray. Basically I would like to copy the same environment and run them parallel for which ray seems to be quite suitable according to the documentation. I manage to initilaize ray and also create one actor per environment. But when I start iterating over the episode and use .remote() execute one function in the environments and use ray.get() to retrieve the values Ray fully all my RAM and my linux will be dead. I tried to limit the max RAM ray can use in it’s config but it seems not to work.
So my question is after some days of reading the document and debugging if:
anybody can give a hint why the memory is eaten away (it should be freed up after each step, but is seems ray might not free up the memory - without ray I had no issues)
do I get something wrong with the concept of .remote()?
how to limit ray’s max RAM usage (RAM means: not the GPU but the RAM accessible for CPU)
Also I tired to experiment with RLLib examples (copy-paste the ones that exist on Ray’s page but the example I’ve tried does not work and has an error). Can anybody recommend a working example to experiment with Ray/RLLib?
I’ve checked it but in the meantime I narrowed down the issue and I think I’ve found the root cause of it.
It is like the following (I have not found it in any antipattern)
I create 20 environments which I would like to run parallel with ray. Each env is an instance of a class of my custome env
This environment contains a function which is called regularly and returns with pytorch tensors.
If I add the .clone() at the return part of the function ray does not mess up my RAM. If I do not add this .clone() ray will cause RAM issues (even my linux stops working and I need to restart).
So something like this
@ray remote
class MyEnv():
def __init__():
mydata_handler_instance = handler_instance()
def get_item(idx: int):
data_on_cpu = my_data_handler_instance.get_the_data(idx)
return data_on_cpu ==>causes the RAM issue
return data_on_cpu.clone() => works fine
And this is the example code I use to call the function in the class of the environments:
ray.get([envs[idx].get_item.remote(idx) for idx in range(nr_of_envs)])
Hi! I will have a look. For me it is not 100% clear what do you mean with output of ray memory. Is it the overview of the dashboard where the memory consumption is shown? Or sg else? - if you could add a screenshot and highlight what are you interested in I could show it to you.