How to deal with labeled image datasets?

Alpe6825 · May 16, 2023, 10:35pm

I just started using ray dataset and I’m not sure if I try to use it in the right way.

I have an image dataset consisting of images (ofc) and several information about each image. For example a label and information about the position of an object on the image. And I have a Ray Cluster with some nodes/workstations.

My first question: Is it possible that I put my data only on one node (head node for example), create the dataset there and then share it with all nodes when using Ray Tune?

And my second question: How can I bring data together? I can read the images with ray.data.read_images and labels with ray.data.from_numpy but how can I merge both datasets? I can neither use .union nor .zip. Is there any solution for this case or do I want to use ray datasets in the wrong way?

I’m glad for any help.

gjoliver · May 19, 2023, 5:48am

Q1, you shouldn’t need to worry about data placement. Ray Data will handle this for you, and ship the data wherever you need.

Q2, yeah union and zip are popular ways. Are you just trying to merge different columns into a single Dataset?

Alpe6825 · May 19, 2023, 1:07pm

How can it be achieve from this point:

ds_img = ray.data.read_images("imgs")  # 7 example images
print(ds_img)

ds_pos = ray.data.from_numpy(np.random.random((7, 6))).repartition(7)
print(ds_pos)

ds_color = ray.data.from_numpy(np.random.random((7, 3))).repartition(7)
print(ds_color)

Dataset(
   num_blocks=7,
   num_rows=7,
   schema={image: ArrowTensorType(shape=(641, 532, 4), dtype=uint8)}
)
Repartition
+- Dataset(
      num_blocks=1,
      num_rows=7,
      schema={__value__: ArrowTensorType(shape=(6,), dtype=double)}
   )
Repartition
+- Dataset(
      num_blocks=1,
      num_rows=7,
      schema={__value__: ArrowTensorType(shape=(3,), dtype=double)}
   )

gjoliver · May 19, 2023, 8:03pm

this does seem like a good use case for ray.data.Dataset.zip
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.zip.html

sometimes, if your data is not comparable in size, you can also create a random access dataset, and just looking up the key while you map over the larger set.
https://docs.ray.io/en/releases-1.12.0/data/random-access.html

Alpe6825 · May 19, 2023, 10:52pm

When I use zip like this (seems as I did something different in the last days when it didn’t work … ):

ds = ds_img.zip(ds_pos)
ds = ds.zip(ds_color)
print(ds.schema())

I get the following:

image: extension<arrow.py_extension_type<ArrowTensorType>>
__value__: extension<arrow.py_extension_type<ArrowTensorType>>
__value___1: extension<arrow.py_extension_type<ArrowTensorType>>

I would like to rename __value__ and __value___1. The only way I found is:

def rename_column(batch, column_name, old_column_name):
    batch[column_name] = batch.pop(old_column_name)
    return batch

ds = ds_img.zip(ds_pos)
    ds = ds.zip(ds_color).map_batches(rename_column, fn_args=("Col", "__value___1")).map_batches(rename_column, fn_args=("Pos", "__value__"))

Is there maybe an easier way? It only works after the last zip. For debugging
it would be better if I could rename it after loading when working with more data.

gjoliver · May 19, 2023, 11:13pm

You can rename them before zip?
something like.

dataset = dataset.map_batches(
    lambda batch: batch.rename(columns={"data": "image", "data_1": "label"}),
    batch_format="pandas",
)

Jules_Damji · May 22, 2023, 11:39pm

@Alpe6825 Does @gjoliver suggestion resolve your problem of merging your disparate datasets?

Alpe6825 · May 23, 2023, 1:10pm

Great It works when I make little changes:

dataset = dataset.map_batches(
    lambda batch: batch.rename(columns={"__value__": "new_name", }),
    batch_format="pyarrow",
)

@gjoliver Thank you very much
@Jules_Damji Yes it was very helpful

Jules_Damji · May 23, 2023, 2:18pm

@Alpe6825 Good to know.

Alpe6825 · May 31, 2023, 6:17pm

@Jules_Damji and @gjoliver

I noticed a little strange behavior with the solution we found the last days. (Maybe it’s a bug?)

If I append new columns with certain names likes this:

for f in np_files:
    name = f.replace(".npy", "")
    _ds = ray.data.from_numpy(np.load(f"files-numpy/{f}"))
    _ds = _ds.map_batches(lambda batch: batch.rename_columns([name]), batch_format="pyarrow")
    print(_ds.schema())
    ds = ds.zip(_ds)

… it only works with print(_ds.schema()) otherwise all columns get the name of the last column with _Number like (pose_1, pose_2, pose_3)

Jules_Damji · May 31, 2023, 6:38pm

@gjoliver Is this a bug?

gjoliver · May 31, 2023, 8:19pm

That’s because all the lambdas you created captured the reference to the same variable name, which has the value of the last loop step.

Topic		Replies	Views
[Datasets] Create custom dataset by grouping/merging existing blocks Ray Data	9	1298	November 30, 2022
[Dataset] function add_column inserts repeats of sub-column instead of whole column Ray Data	2	427	November 30, 2022
Ray dataset with multiple images per batch	5	218	September 1, 2023
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1249	February 14, 2023
How to convert Pytorch torch.utils.data.Dataset to ray.data.dataset?	15	1400	December 8, 2022

How to deal with labeled image datasets?

Related topics