Issue with execution priority when running multiple high stress remote functions

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

So I was just messing around, because i wanted a webfrontend for heavy calculations, and I dont do much web stuff. Anyway I ran into a problem with ray:

import asyncio
from pywebio.input import *
from pywebio.output import *
from pywebio import start_server
from pywebio import start_server

import time
import ray
import asyncio

ray.init()

@ray.remote
def stress(num):
    return sum([i * j * k for i in range(num) for j in range(i) for k in range(j)])

@ray.remote
def stress_function(num):
    return ray.get([stress.remote(num) for _ in range(num)])


async def test_future(num):
    t1 = time.time()
    ref = stress_function.remote(num)
    fut: asyncio.Future = asyncio.wrap_future(ref.future())
    result = await fut
    t = time.time() - t1
    return t, result

async def test_sleep(num):
    print(num)
    print('start calc')
    t1 = time.time()
    await asyncio.sleep(num)
    t = time.time() - t1
    return t


def main():
    num = input("Number: ", type=FLOAT)
    put_text(f'Input Number: {num}')
    t, res = asyncio.run(test_future(num))
    put_text(f'{t}')
    # print(res)

start_server(main, auto_open_webbrowser=True)

The issue is:
If we open the site multiple times, and type in “300” and click “Submit” on one site and “10” and click “Submit” in another, I would expect to get the result on the site with 10 iterations way sooner than the result of the 300 iterations. And on Windows its mostly works like that - sometimes it does not.
But on linux and macOS it just runs first in first out so it exclusivly calculates the 300 iterations and after its done it calculates the 10 iterations. I’m not saying it should not work like that fifo has is very useful but in a web application it not desirable - I guess its some kind of scheduling issue but I didnt find anything.

It seems to be a ray “issue” not a asyncio issue because it works just a you would expect with async.sleep.

Tested on windows 11 py3.10, linux py3.11 with wsl and on apple m1 macos 14 with py 3.11 and 3.10

Maybe someone know some advice :slight_smile:

Best regards

@Blissed First, welcome to the Ray community. Thanks for taking it for a spin.

cc: @chengsu anyone in the scheduling core team can share any nuances here with using asyncio? The expected behavior would be in either case 10 should be finished sooner than 300.

Thank you, I’m by no means an expert in asyncio this code is pretty much the same a in the documentation.
https://docs.ray.io/en/latest/ray-core/actors/async_api.html
And all other implementations on in the documentation cause Windows to have the same behavior as linux and macos

Btw ray is amazing! I really like it :smiley:

I did a bit more testing and made a sort of benchmark, its pretty easy just duplicate the tab 4 times and you are running 300 its., 210its., 120its. and 30its. on linux its very inconsistent sometimes it works like you would expect and other times its running consecutively.

I modified the code to be quicker to benchmark:

from pywebio.input import *
from pywebio.output import *
from pywebio import start_server

import sys
import time
import ray
import asyncio

counter = 300

ray.init()


@ray.remote
def stress(num):
    return sum([i * j * k for i in range(num) for j in range(i) for k in range(j)])


@ray.remote
def stress_function(num):
    return ray.get([stress.remote(num) for _ in range(num)])


async def test_future(num):
    t1 = time.time()
    ref = stress_function.remote(num)
    fut: asyncio.Future = asyncio.wrap_future(ref.future())
    result = await fut
    t = time.time() - t1
    return t, result


async def test_sleep(num):
    print(num)
    print('start calc')
    t1 = time.time()
    await asyncio.sleep(num)
    t = time.time() - t1
    return t


def main():
    # num = input("Number: ", type=FLOAT)
    global counter
    c = counter
    counter = counter - 90
    if c < 1:
        c = 300
        counter = 210

    put_text(sys.version)
    put_text(f'Input Number: {c}')
    t, res = asyncio.run(test_future(c))
    put_text(f'{t}')
    # print(res)
    print(f'calculation of {c} interations took {t} sec')


start_server(main, auto_open_webbrowser=True)

On windows it works very often on py3.8 and 3.10 and rarely on py3.11. On linux it sometimes works on py 3.11

Take a look at this log and I have to stress I did change anything just started the server again and again:

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:20:12,440 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:35691/ to access the application
gio: http://127.0.0.1:35691: Operation not supported
1700342444.3505282: calculation of 300 interations took 16.438920497894287 sec
1700342448.0011559: calculation of 210 interations took 16.400689601898193 sec
1700342448.383441: calculation of 30 interations took 12.763494968414307 sec
1700342448.385262: calculation of 120 interations took 14.418421030044556 sec
1700342483.5513709: calculation of 300 interations took 16.150527477264404 sec
1700342486.9258401: calculation of 210 interations took 17.632534503936768 sec
1700342487.273974: calculation of 30 interations took 14.958171367645264 sec
1700342487.2784004: calculation of 120 interations took 16.543764114379883 sec
1700342618.7501733: calculation of 300 interations took 16.23596501350403 sec
1700342622.038028: calculation of 210 interations took 17.441118955612183 sec
1700342622.408697: calculation of 120 interations took 16.332262754440308 sec
1700342622.40896: calculation of 30 interations took 14.794549942016602 sec
^CTraceback (most recent call last):
  File "/home/j/python_scripts/ray_web_test.py", line 56, in <module>
    start_server(main, auto_open_webbrowser=True)
  File "/home/j/.local/lib/python3.11/site-packages/pywebio/platform/tornado.py", line 302, in start_server
    tornado.ioloop.IOLoop.current().start()
  File "/home/j/.local/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:26:13,510 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:57435/ to access the application
gio: http://127.0.0.1:57435: Operation not supported
1700342798.9952323: calculation of 300 interations took 16.322675466537476 sec
1700342802.3376486: calculation of 210 interations took 17.52221155166626 sec
1700342802.7120988: calculation of 30 interations took 15.763528108596802 sec
1700342802.7164593: calculation of 120 interations took 16.72382354736328 sec
^CTraceback (most recent call last):
  File "/home/j/python_scripts/ray_web_test.py", line 56, in <module>
    start_server(main, auto_open_webbrowser=True)
  File "/home/j/.local/lib/python3.11/site-packages/pywebio/platform/tornado.py", line 302, in start_server
    tornado.ioloop.IOLoop.current().start()
  File "/home/j/.local/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:27:01,325 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:38915/ to access the application
gio: http://127.0.0.1:38915: Operation not supported
1700342843.4577658: calculation of 300 interations took 18.210181951522827 sec
1700342843.6843748: calculation of 120 interations took 14.03867769241333 sec
1700342847.1199703: calculation of 30 interations took 16.302428245544434 sec
1700342847.1664753: calculation of 210 interations took 18.57814383506775 sec
^CTraceback (most recent call last):
  File "/home/j/python_scripts/ray_web_test.py", line 56, in <module>
    start_server(main, auto_open_webbrowser=True)
  File "/home/j/.local/lib/python3.11/site-packages/pywebio/platform/tornado.py", line 302, in start_server
    tornado.ioloop.IOLoop.current().start()
  File "/home/j/.local/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:27:38,558 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:35943/ to access the application
gio: http://127.0.0.1:35943: Operation not supported
1700342882.0286176: calculation of 300 interations took 17.839910745620728 sec
1700342885.421515: calculation of 210 interations took 13.778618097305298 sec
1700342885.8063257: calculation of 120 interations took 12.168965578079224 sec
1700342885.808992: calculation of 30 interations took 11.02982211112976 sec
^CTraceback (most recent call last):
  File "/home/j/python_scripts/ray_web_test.py", line 56, in <module>
    start_server(main, auto_open_webbrowser=True)
  File "/home/j/.local/lib/python3.11/site-packages/pywebio/platform/tornado.py", line 302, in start_server
    tornado.ioloop.IOLoop.current().start()
  File "/home/j/.local/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
^CException ignored in atexit callback: <function shutdown at 0x7f332a97f6a0>
Traceback (most recent call last):
  File "/home/j/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/.local/lib/python3.11/site-packages/ray/_private/worker.py", line 1737, in shutdown
    time.sleep(0.5)
KeyboardInterrupt:

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:28:19,379 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:34803/ to access the application
gio: http://127.0.0.1:34803: Operation not supported
1700342921.8835285: calculation of 300 interations took 16.83345103263855 sec
1700342925.4815977: calculation of 210 interations took 19.313156366348267 sec
1700342925.8852403: calculation of 30 interations took 18.02552342414856 sec
1700342925.8885205: calculation of 120 interations took 18.961016178131104 sec
^CTraceback (most recent call last):
  File "/home/j/python_scripts/ray_web_test.py", line 56, in <module>
    start_server(main, auto_open_webbrowser=True)
  File "/home/j/.local/lib/python3.11/site-packages/pywebio/platform/tornado.py", line 302, in start_server
    tornado.ioloop.IOLoop.current().start()
  File "/home/j/.local/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

[j@X570-WS python_scripts]$ python3.11 ra*
2023-11-18 22:28:57,210 INFO worker.py:1673 -- Started a local Ray instance.
Running on all addresses.
Use http://172.22.185.60:34917/ to access the application
gio: http://127.0.0.1:34917: Operation not supported
1700342944.9960535: calculation of 120 interations took 2.5228018760681152 sec
1700342950.1519718: calculation of 30 interations took 2.006753444671631 sec
1700342952.0541008: calculation of 210 interations took 10.675888061523438 sec
1700342961.2478642: calculation of 300 interations took 20.914763927459717 sec
**This is how you would exepect it to run...**