I tried to apply ray to multiprocess a function named ‘get_reward’ and got the following error
The remote function main.get_rewards is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
The first question is: how to solve the situation?
The second question is: In jupyter notebook, after I run the code cell (for the first time), I get the error, then If I run the same code cell again, then there is no error. What is happening? I found that even the second run enables the function.remote(), I can’t get the result with ray.get().
When you first decoreate your method with ray.remote, Ray serializes the function definition and export it to its storage (so that other workers can import it and use it). At this time, if your function def is too big, the error occurs.
I think the reason why it doesn’t occur in the second time is because it has been already exported.
It usually happens if your application “captures” some objects implicitly. For example,
obj = <big_obj>
@ray.remote
def f():
return obj
In this case, the big object obj is embedded into the remote method, and when it is serialized, the whole value is included. You can avoid this by passing the obj instead of passing it implicitly.
actions # a list, contains about 2000k elements
metric # a class instance with some methods
@ray.remote
def get_reward(action):
a = metric.method1(action)
b = metric.method2(a)
c = metric.method3(b)
return c
rewards = ray.get([get_reward.remote(action) for action in actions])
The remote function main.get_reward is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB).
And I also tried something like this:
actions # a list, contains about 2000k elements
metric # a class instance with some methods
@ray.remote
def get_reward(action, metric):
a = metric.method1(action)
b = metric.method2(a)
c = metric.method3(b)
return c
metric_ = ray.put(metric)
rewards = ray.get([get_reward.remote(action, metric_) for action in actions])
and I found it a little bit faster than the for loop but I don’t know if it is the right solution to the first code block (also I found only a few amount of each cpu core is used)
From the second code, do you still get the message The remote function main.get_reward is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB).?
Also, about the CPU usage, it is probably because your get_reward method is too short or not use enough CPU. I recommend you to take a look at Tips for first-time users — Ray 3.0.0.dev0.
You can measure the remote task time by
@ray.remote
def get_reward(action, metric):
import time
start = time.time()
a = metric.method1(action)
b = metric.method2(a)
c = metric.method3(b)
print(time.time() - start)
return c
if your metric.method is IO heavy, it is possible, the remote task is not using enough CPU. In this case, you can request less CPU on your task
I don’t get the error message with the second code, however, I’m still curious if the second code is the optimal solution to the error message.
The reason why I mention about the CPU usage is: when I use multiprocessing module (map function) to do the same job, I found the code fully utilize the CPU.
How long does get_reward run normally? I think we need to understand if the bottleneck is serializing/deseiralizing metrics_ class or not. If that’s the bottleneck, you can use actors alternatively.
It is hard to answer why multiprocessing utilizes CPU better without seeing the actual code. Do you know how many processes your multiprocessing code starts?