F.remote() calls taking a while to return

Dmitri · May 8, 2023, 6:59pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

Medium

We’re experiencing a situation in which calls to schedule tasks, f.remote(), take a very long time to return. These calls are supposed to be basically async and return quickly,
This happens specifically when we’re submitting many tasks (tens of thousands).

This is bad for compute utilization.

Qs:
How do calls to f.remote() work?
How can one diagnose why these calls are taking so long to return?
Are there any good patterns or best practices for submitting many tasks?

Huaiwei_Sun · May 10, 2023, 10:40pm

Try the timeline feature and profile the tasks?
Ray Dashboard — Ray 2.4.0
Use state api to get a task with --detail to see the task events? ray.experimental.state.common.TaskState — Ray 2.4.0

Dmitri · May 10, 2023, 11:03pm

Thanks for the tip, Huaiwei!
We can probably start by taking a look at ray.timeline().

cade · May 16, 2023, 8:59pm

Note that object serialization is currently a synchronous operation, which f.remote() does implicitly for arguments. Not sure if that’s the cause, curious to hear what y’all find…

Dmitri · May 17, 2023, 4:23am

Thanks Cade! This possibility was in the back of my head too. Will point this out to the team running into the issue.

Jules_Damji · May 17, 2023, 4:32pm

Right, so if any args sent to the f.remote() are large, they will have to be serialized, which could add to tasks in the hundreds, unless you only send object_refs (which too will have to be deserialized from object store).

Dmitri · May 18, 2023, 4:00am

I imagine you could probably also end up serializing stuff you accidentally closed over (in addition to explicit arguments)

Topic		Replies	Views
Question about scheduling of remote calls when passing refs Ray Core	12	401	March 2, 2022
Debugging inside cv.wait_for() Ray Core	9	37	September 5, 2024
Actor remote function blocks on client Ray Core	10	491	April 20, 2022
[perf] ray remote function call performance issue Ray Client	0	86	May 11, 2024
.remote() call occasionally hangs Ray Serve	3	298	October 7, 2024

F.remote() calls taking a while to return

Related topics