How to deal with labeled image datasets?

I just started using ray dataset and I’m not sure if I try to use it in the right way.

I have an image dataset consisting of images (ofc) and several information about each image. For example a label and information about the position of an object on the image. And I have a Ray Cluster with some nodes/workstations.

My first question: Is it possible that I put my data only on one node (head node for example), create the dataset there and then share it with all nodes when using Ray Tune?

And my second question: How can I bring data together? I can read the images with ray.data.read_images and labels with ray.data.from_numpy but how can I merge both datasets? I can neither use .union nor .zip. Is there any solution for this case or do I want to use ray datasets in the wrong way?

I’m glad for any help. :slight_smile:

Q1, you shouldn’t need to worry about data placement. Ray Data will handle this for you, and ship the data wherever you need.

Q2, yeah union and zip are popular ways. Are you just trying to merge different columns into a single Dataset?

How can it be achieve from this point:

ds_img = ray.data.read_images("imgs")  # 7 example images
print(ds_img)

ds_pos = ray.data.from_numpy(np.random.random((7, 6))).repartition(7)
print(ds_pos)

ds_color = ray.data.from_numpy(np.random.random((7, 3))).repartition(7)
print(ds_color)
Dataset(
   num_blocks=7,
   num_rows=7,
   schema={image: ArrowTensorType(shape=(641, 532, 4), dtype=uint8)}
)
Repartition
+- Dataset(
      num_blocks=1,
      num_rows=7,
      schema={__value__: ArrowTensorType(shape=(6,), dtype=double)}
   )
Repartition
+- Dataset(
      num_blocks=1,
      num_rows=7,
      schema={__value__: ArrowTensorType(shape=(3,), dtype=double)}
   )

this does seem like a good use case for ray.data.Dataset.zip
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.zip.html

sometimes, if your data is not comparable in size, you can also create a random access dataset, and just looking up the key while you map over the larger set.
https://docs.ray.io/en/releases-1.12.0/data/random-access.html

When I use zip like this (seems as I did something different in the last days when it didn’t work … ):

ds = ds_img.zip(ds_pos)
ds = ds.zip(ds_color)
print(ds.schema())

I get the following:

image: extension<arrow.py_extension_type<ArrowTensorType>>
__value__: extension<arrow.py_extension_type<ArrowTensorType>>
__value___1: extension<arrow.py_extension_type<ArrowTensorType>>

I would like to rename __value__ and __value___1. The only way I found is:

def rename_column(batch, column_name, old_column_name):
    batch[column_name] = batch.pop(old_column_name)
    return batch

ds = ds_img.zip(ds_pos)
    ds = ds.zip(ds_color).map_batches(rename_column, fn_args=("Col", "__value___1")).map_batches(rename_column, fn_args=("Pos", "__value__"))

Is there maybe an easier way? It only works after the last zip. For debugging
it would be better if I could rename it after loading when working with more data.

You can rename them before zip?
something like.

dataset = dataset.map_batches(
    lambda batch: batch.rename(columns={"data": "image", "data_1": "label"}),
    batch_format="pandas",
)
1 Like

@Alpe6825 Does @gjoliver suggestion resolve your problem of merging your disparate datasets?

Great :partying_face: It works when I make little changes:

dataset = dataset.map_batches(
    lambda batch: batch.rename(columns={"__value__": "new_name", }),
    batch_format="pyarrow",
)

@gjoliver Thank you very much :smiley:
@Jules_Damji Yes it was very helpful :slight_smile:

@Alpe6825 Good to know.

@Jules_Damji and @gjoliver

I noticed a little strange behavior with the solution we found the last days. (Maybe it’s a bug?)

If I append new columns with certain names likes this:

for f in np_files:
    name = f.replace(".npy", "")
    _ds = ray.data.from_numpy(np.load(f"files-numpy/{f}"))
    _ds = _ds.map_batches(lambda batch: batch.rename_columns([name]), batch_format="pyarrow")
    print(_ds.schema())
    ds = ds.zip(_ds)

… it only works with print(_ds.schema()) otherwise all columns get the name of the last column with _Number like (pose_1, pose_2, pose_3)

@gjoliver Is this a bug?

That’s because all the lambdas you created captured the reference to the same variable name, which has the value of the last loop step.

1 Like