An error with function size threshold

ad26kr · August 29, 2022, 8:17am

Hi, I am newbie here.

I tried to apply ray to multiprocess a function named ‘get_reward’ and got the following error

The remote function main.get_rewards is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.

The first question is: how to solve the situation?

The second question is: In jupyter notebook, after I run the code cell (for the first time), I get the error, then If I run the same code cell again, then there is no error. What is happening? I found that even the second run enables the function.remote(), I can’t get the result with ray.get().

sangcho · August 31, 2022, 10:50pm

When you first decoreate your method with ray.remote, Ray serializes the function definition and export it to its storage (so that other workers can import it and use it). At this time, if your function def is too big, the error occurs.

I think the reason why it doesn’t occur in the second time is because it has been already exported.

It usually happens if your application “captures” some objects implicitly. For example,

obj = <big_obj>

@ray.remote
def f():
    return obj

In this case, the big object obj is embedded into the remote method, and when it is serialized, the whole value is included. You can avoid this by passing the obj instead of passing it implicitly.

ref = ray.put(obj)

@ray.remote
def f(obj):
    return obj

ad26kr · September 2, 2022, 7:04am

what if my code looks like the following:

actions # a list, contains about 2000k elements
metric # a class instance with some methods

@ray.remote
def get_reward(action):
    a = metric.method1(action)
    b = metric.method2(a)
    c = metric.method3(b)
    return c

rewards = ray.get([get_reward.remote(action) for action in actions])

The remote function main.get_reward is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB).

And I also tried something like this:

actions # a list, contains about 2000k elements
metric # a class instance with some methods

@ray.remote
def get_reward(action, metric):
    a = metric.method1(action)
    b = metric.method2(a)
    c = metric.method3(b)
    return c

metric_ = ray.put(metric)
rewards = ray.get([get_reward.remote(action, metric_) for action in actions])

and I found it a little bit faster than the for loop but I don’t know if it is the right solution to the first code block (also I found only a few amount of each cpu core is used)

sangcho · September 7, 2022, 1:38pm

From the second code, do you still get the message The remote function main.get_reward is too large (521 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB).?

Also, about the CPU usage, it is probably because your get_reward method is too short or not use enough CPU. I recommend you to take a look at Tips for first-time users — Ray 3.0.0.dev0.

You can measure the remote task time by

@ray.remote
def get_reward(action, metric):
    import time
    start = time.time()
    a = metric.method1(action)
    b = metric.method2(a)
    c = metric.method3(b)
    print(time.time() - start)
    return c

if your metric.method is IO heavy, it is possible, the remote task is not using enough CPU. In this case, you can request less CPU on your task

@ray.remote(num_cpus=0.1)
def get_reward()

ad26kr · September 7, 2022, 11:10pm

Thanks for the reply.

I don’t get the error message with the second code, however, I’m still curious if the second code is the optimal solution to the error message.

The reason why I mention about the CPU usage is: when I use multiprocessing module (map function) to do the same job, I found the code fully utilize the CPU.

sangcho · September 8, 2022, 6:37pm

How long does get_reward run normally? I think we need to understand if the bottleneck is serializing/deseiralizing metrics_ class or not. If that’s the bottleneck, you can use actors alternatively.

It is hard to answer why multiprocessing utilizes CPU better without seeing the actual code. Do you know how many processes your multiprocessing code starts?

ad26kr · September 9, 2022, 3:42pm

get_reward takes about 0.2s for a single action.

I found 32 processes.

Topic		Replies	Views
Remote function too large - function size error Ray Clusters	3	1286	May 10, 2023
What should I do with big data sets	0	514	September 26, 2022
Function_size_error_threshold Ray Core	3	252	February 24, 2024
Using ray.put for LARGE numpy arrays Ray Core	12	1506	July 27, 2023
Is it an atipattern to put a function with closure to ray object storage	0	131	January 15, 2024

An error with function size threshold

Related topics