Streaming data for training/evaluation/inference

Can Ray Data be an option to allow a training/evaluation process to directly consume data from a data generating process without writing data to disc?

Considering the following naive pipeline:

  1. Ray Tune to generate training data in parallel and store on disc as tfrecords
  2. Ray Tune to train ML model(s) using stored data and Tensorflow Data Pipeline
  3. Ray Tune to generate evaluation data in parallel and store on disc as tfrecords
  4. Ray Tune to evaluate ML model using stored data and Tensorflow Data Pipeline

Is there a way to combine steps 1-2 and 3-4 using Ray Data? I.e. instead of storing data to disk, step 2 and 4 consume the output from step 1 and 3. With can pass a Ray Serve handle to Ray Tune, which potentially solves this for step 3-4. For combining step 1-2, we could leverage Ray Serve in a similar way as in the RL examples, but we would not leverage the Data Pipeline ideas to accelerate training speed. This would be true also for combining step 3-4.

In the offline case the Tensorflow Data Pipeline in steps 2 and 4 could be exchanged with Ray Data Pipeline. Do you have a solution for the “streaming” case?

Very interesting! Quick question @jharaldson , why are you using Ray Tune to generate training data?

@rliaw the main reason for using Ray Tune to generate data is that it provides a nice abstraction for parallel execution of jobs, handling configuration parameters and integrates with the autoscaler.

Here is an example. We make use of actors (with re-use) that together should execute 1 million small simulations (takes 1-2 seconds / simulation). Each simulation generates a number of training samples. In the offline case these samples are written to disc and then used for training. When doing evaluation we need to execute 10 million small simulations. In the offline case the samples generated from simulation are written to disc and then used for evaluation. This takes up a lot of disc space, requires clean up, and adds waiting time before initial evaluation score can be attained. For certain metrics the simulations would in addition need to query the trained model and use the results. RLLib and Ray Serve have fairly good coverage for this case.

As a side note, sometimes the simulations are run using Matlab code (packaged as a python library using Matlab Engine API for Python). In the offline case the Matlab code is triggered from the Ray Actor and results returned to the Python environment. In the “online” case we use http requests from the Matlab code to query Ray Serve, but this adds a lot of overhead to the simulation time, which makes larger evaluations very slow. Not sure if Ray Data can be an option.

One way I can see of doing this is to use only Datasets + Tune. Datasets can handle parallel training and inference as well, which might be the easiest in this use case.

So instead of Tune(gen data) -> Tune(train models) -> Tune(gen eval data) -> Tune(eval models), you could do Datasets(gen data) -> Tune(train models); Dataset(gen eval data) -> Tune(eval models).

The code for one of the Datasets -> Tune steps could look like this (we will use Datasets’ ActorPool functionality to enable actor re-use for data generation):

from ray.data import ActorPoolStrategy


class StatefulGenData:
   def __init__(self):
       # setup
   def __call_(self, i):
      # gen data for index i

# Gen 1m rows of data using stateful actors pool of size 10.
ds = ray.data.range(1000000)
ds = ds.map(StatefulGenData, compute=ActorPoolStrategy(10, 10))

# Use the dataset in tune for training
def train_fn(...):
    print(ds)  # can access the dataset in-memory

tune.run(train_fn, ...)