Ray.data.from_numpy error

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

This is my sample code:

import numpy as np
import ray

u = np.array([
    ['NG', 'YU', 'M', '1993-09-01', 'US. 100', 'Queen', 'D.C', 'True'],
    ['NG', 'YU', 'M', '1993-09-01', 'US. 100', 'Queen', 'D.C', 'True'],
    ['NG', 'YU', 'M', '1993-09-01', 'US. 100', 'Queen', 'D.C', 'True'],
])

ds = ray.data.from_numpy(u)

Aften I run this python file, I got three error, below is one of them:

The error is :

ValueError: Type’s expected number of buffers (3) did not match the passed number (2).

I don’t know why about this, I’m a newbie rookie, can anyone help me with how to solve this problem, I’ve read the ray documentation but still don’t know how to solve it, thank you all in advance.

Hi @more, thanks for posting this!

Unfortunately, representing n-dimensional arrays of strings in a Ray Dataset is not yet supported. We use an Arrow and Pandas extension type for representing tensor columns, and we haven’t added support for the string datatype yet: [Datasets] Support string tensor columns. · Issue #28410 · ray-project/ray · GitHub

However, it looks like your data might be a 2D table; we do support scalar string columns in tabular datasets, which are represented using Pandas DataFrames and/or Arrow Tables under-the-hood. Would this work for your use case?

import pandas as pd
import ray

u = pd.DataFrame({
    "a": ["NG", "NG", "NG"],
    "b": ["YU", "YU", "YU"],
    "c": ["M", "M", "M"],
    "date": ["1993-09-01", "1993-09-01", "1993-09-01"],
    "e": ["US.100", "US.100", "US.100"],
    "f": ["Queen", "Queen", "Queen"],
    "g": ["D.C", "D.C", "D.C"],
    "h": ["True", "True", "True"],
})

ds = ray.data.from_pandas(u)
1 Like

Thank you for your professional response, finally answering this question that has been bothering me for a long time, a very viable solution to my needs, again, thank you.

1 Like