Launching subprocesses inside a ray.remote is 10x slower

Launching subprocesses using asyncio.create_subprocess_exec inside a ray.remote is 10x slower. For example:

import asyncio
import ray
from tqdm.asyncio import tqdm_asyncio


async def arun():
    await tqdm_asyncio.gather(*[
        asyncio.create_subprocess_exec("bash", "-c", "sleep 0.0001")
        for _ in range(3000)
    ])


@ray.remote
def run():
    asyncio.run(arun())


ray.init()

asyncio.run(arun())  # 3s, 1k it/s
ray.get(run.remote())  # 30s, 100 it/s (10x slower)

It’s even worse when the task is more heavy (e.g. you get a 30x when cating a file to dev/null)
I’m using ray-2.9.2 and python 3.10.11 in Ubuntu 22.04

The use case: I want to train a model to write correct code, and during eval I generate many programs and run them against test cases.

I don’t understand why it slows down, and I would really appreciate an explanation of why this happens and how to fix it!