F.remote() calls taking a while to return

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

Medium

We’re experiencing a situation in which calls to schedule tasks, f.remote(), take a very long time to return. These calls are supposed to be basically async and return quickly,
This happens specifically when we’re submitting many tasks (tens of thousands).

This is bad for compute utilization.

Qs:
How do calls to f.remote() work?
How can one diagnose why these calls are taking so long to return?
Are there any good patterns or best practices for submitting many tasks?

  1. Try the timeline feature and profile the tasks?
    Ray Dashboard — Ray 2.4.0

  2. Use state api to get a task with --detail to see the task events? ray.experimental.state.common.TaskState — Ray 2.4.0

1 Like

Thanks for the tip, Huaiwei!
We can probably start by taking a look at ray.timeline().

Note that object serialization is currently a synchronous operation, which f.remote() does implicitly for arguments. Not sure if that’s the cause, curious to hear what y’all find…

Thanks Cade! This possibility was in the back of my head too. Will point this out to the team running into the issue.

Right, so if any args sent to the f.remote() are large, they will have to be serialized, which could add to tasks in the hundreds, unless you only send object_refs (which too will have to be deserialized from object store).

I imagine you could probably also end up serializing stuff you accidentally closed over (in addition to explicit arguments)

1 Like