Ray Dataset from_generator equivalent

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I have a use case where I was hoping to use ray to consume from a set of files where each file can fit into memory but the combination of them cannot. Unfortunately, these files are all stored in a custom format that cannot be read using pyarrow. I had been hoping to use something like Tensorflow’s from_generator function to do this but didn’t see any obvious way to do this from the current documentation.

Is this possible within Ray Data or would you have other suggestions on how to approach this task?

Thanks for your advice!

Hi @nateyoder, if you know the files names in advance, you can do something like

def read_file(file_name):
  # Your custom function to read the file in the custom format.
  data = ...
  return {"data": data}

file_names = ["foo", "bar", ...] # The list of file names.
ds = ray.data.from_items(file_names)
ds = ds.map(read_file)