Multi-stage fanning pipeline using Ray: Queues + Actors vs. Workflows

Seanny123 · November 26, 2021, 6:47pm

I have built a multi-stage pipeline using Ray. The first stage takes a single input from the calling process and generates N inputs. The next stage consumes those N inputs and returns their results to the original calling process.

I have implemented this using Ray Actors and Queues. Here is a toy example with trivial computation:

import ray
from ray.util.queue import Queue

num_queue = Queue(maxsize=100)
str_queue = Queue(maxsize=100)
out_queue = Queue(maxsize=100)


@ray.remote
def append_a(get_queue: Queue, put_queue: Queue):
    while num := get_queue.get(block=True):
        print(f"got work {num}")
        put_queue.put(f"{num}a")


@ray.remote
def append_b(get_queue: Queue, put_queue: Queue):
    while num_str := get_queue.get(block=True):
        print(f"got work {num_str}")
        for i in range(3):
            put_queue.put(f"{num_str}{i}b")


# create two workers for each stage
for _ in range(2):
    append_a.remote(num_queue, str_queue)
    append_b.remote(str_queue, out_queue)


# submit to queue
for i in range(10):
    num_queue.put(str(i))


# retrieve results
for i in range(10*3):
    print(out_queue.get())

Using Queues and Actors does work, but if feels fragile/awkward:

Queue capacity and actor state are not easily monitorable via the Ray dashboard.
Manually creating actors prevents Ray from auto-scaling in response to queue pressure.
If an actor fails, it’s difficult for the original calling process to respond accordingly. Instead, my original calling process can get stuck waiting at out_queue.get()

Is there some other way to accomplish persistent actor/consumer which automatically scales? Is this the use-case for Ray Workflows, which is still in alpha? Should I be using Ray in combination with AirFlow instead?

sangcho · November 29, 2021, 4:33am

cc @yic can you answer this question?

ericl · January 31, 2022, 10:44pm

For this kind of streaming topology, you can try Dataset Pipelines — Ray v1.9.2

It would look something like this

def source():
    for i in range(100):
        yield ray.data.from_items(["input", "items", "for", "batch"])

pipe = DatasetPipeline.from_iterable(source) \
    .map(append_a) \
    .map(append_b) \

for output in pipe.iter_rows():
    print(output)

Seanny123 · April 22, 2022, 12:15am

It even handles more complicated pipelines!

import ray

def prepend_a(val):
    # somehow only called once!
    print("a", val)
    return f"a{val}"

def append_b(val):
    return f"{val}b"

def append_c(val):
    return f"{val}c"


data = ray.data.from_items([str(i) for i in range(10)])

a_appended = data.map(prepend_a)
final_b = a_appended.map(append_b)
final_c = a_appended.map(append_c)

for bb, cc in zip(final_b.iter_rows(), final_c.iter_rows()):
    print(bb, cc)

Topic		Replies	Views
Pipeline with queues between the actors Ray Core	1	373	October 27, 2023
Share a list/queue across multiple actors Ray Core	5	344	February 27, 2024
Ray spawns too many actors	1	115	July 1, 2024
Using ray.util.queue between python processes Ray Core	2	440	February 22, 2023
Understanding the working of ray Queue Ray Core	1	742	June 21, 2022

Multi-stage fanning pipeline using Ray: Queues + Actors vs. Workflows

Related topics