Error between Modin and Xgboost_Ray

when i verify the Ray example as below there still the error. Is there anybody to have some idea? thanks.


RayTaskError(ValueError) Traceback (most recent call last)
/tmp/ipykernel_46448/1110740239.py in
8
9 # Train the classifier
—> 10 bst = train(
11 params=xgboost_params,
12 dtrain=train_set,

/opt/conda/lib/python3.9/site-packages/xgboost_ray/main.py in train(params, dtrain, num_boost_round, evals, evals_result, additional_results, ray_params, _remote, *args, **kwargs)
1284 _wrapped = force_on_current_node(_wrapped)
1285
→ 1286 bst, train_evals_result, train_additional_results = ray.get(
1287 _wrapped.remote(
1288 params,

/opt/conda/lib/python3.9/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
102 # we only convert init function if RAY_CLIENT_MODE=1
103 if func.name != “init” or is_client_mode_enabled_by_default:
→ 104 return getattr(ray, func.name)(*args, **kwargs)
105 return func(*args, **kwargs)
106

/opt/conda/lib/python3.9/site-packages/ray/util/client/api.py in get(self, vals, timeout)
42 timeout: Optional timeout in milliseconds
43 “”"
—> 44 return self.worker.get(vals, timeout=timeout)
45
46 def put(self, *args, **kwargs):

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in get(self, vals, timeout)
436 op_timeout = max_blocking_operation_time
437 try:
→ 438 res = self._get(to_get, op_timeout)
439 break
440 except GetTimeoutError:

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in _get(self, ref, timeout)
464 logger.exception(“Failed to deserialize {}”.format(chunk.error))
465 raise
→ 466 raise err
467 if chunk.total_size > OBJECT_TRANSFER_WARNING_SIZE and log_once(
468 “client_object_transfer_size_warning”

RayTaskError(ValueError): ray::_wrapped() (pid=3279436, ip=192.168.156.43)
File “/opt/conda/lib/python3.9/site-packages/xgboost_ray/main.py”, line 1275, in _wrapped
File “/tmp/ray/session_2022-08-23_22-10-24_470493_112/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/main.py”, line 1453, in train
bst, train_evals_result, train_additional_results = _train(
File “/tmp/ray/session_2022-08-23_22-10-24_470493_112/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/main.py”, line 1011, in _train
dtrain.assert_enough_shards_for_actors(num_actors=ray_params.num_actors)
File “/tmp/ray/session_2022-08-23_22-10-24_470493_112/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 748, in assert_enough_shards_for_actors
self.loader.assert_enough_shards_for_actors(num_actors=num_actors)
File “/tmp/ray/session_2022-08-23_22-10-24_470493_112/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 450, in assert_enough_shards_for_actors
data_source = self.get_data_source()
File “/tmp/ray/session_2022-08-23_22-10-24_470493_112/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 436, in get_data_source
raise ValueError(
ValueError: Invalid data source type: <class ‘modin.pandas.dataframe.DataFrame’> with FileType: None for a distributed dataset.
FIX THIS by passing a supported data type. Supported data types for distributed datasets are a list of CSV or Parquet sources. If using Modin, Dask, or Petastorm, make sure the library is installed.

Hey @Hongming_Zheng , do you have modin installed? Can you share the command you ran?

Hi Matthew, the command to install modin is pip install “modin[ray] @ git+https://github.com/modin-project/modin” . the code is just the xgboost-ray example code as link i attached. thanks. Hongming

Hmm I wasn’t able to reproduce this with a fresh environment:

conda create -n ray-modin-39 python=3.9  
conda activate ray-modin-39
pip install xgboost_ray "modin[ray] @ git+https://github.com/modin-project/modin"

I was then able to run the job:

python simple_modin.py --smoke-test
2022-08-27 21:36:29,156	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1
2022-08-27 21:36:30,601	INFO main.py:1005 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2022-08-27 21:36:33,715	INFO main.py:1050 -- [RayXGBoost] Starting XGBoost training.
(_RemoteRayXGBoostActor pid=69844) [21:36:33] task [xgboost.ray]:140345779922928 got new rank 0
(_RemoteRayXGBoostActor pid=69846) [21:36:33] task [xgboost.ray]:140302458791920 got new rank 2
(_RemoteRayXGBoostActor pid=69845) [21:36:33] task [xgboost.ray]:140551510271984 got new rank 1
(_RemoteRayXGBoostActor pid=69855) [21:36:33] task [xgboost.ray]:140390768027632 got new rank 3
2022-08-27 21:36:35,530	INFO main.py:1546 -- [RayXGBoost] Finished XGBoost training on training data with total N=32 in 4.97 seconds (1.81 pure XGBoost training time).
Final training error: 0.2500

Here are the versions of the dependencies I have installed:

pip freeze
aiohttp==3.8.1
aiohttp-cors==0.7.0
aiosignal==1.2.0
async-timeout==4.0.2
attrs==22.1.0
blessed==1.19.1
cachetools==5.2.0
certifi @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_83242e7e-f82d-4a71-8ef2-9d71d212d249gu_wxmeq/croots/recipe/certifi_1655968827803/work/certifi
charset-normalizer==2.1.1
click==8.0.4
colorful==0.5.4
distlib==0.3.6
filelock==3.8.0
frozenlist==1.3.1
fsspec==2022.7.1
google-api-core==2.8.2
google-auth==2.11.0
googleapis-common-protos==1.56.4
gpustat==1.0.0rc1
grpcio==1.43.0
idna==3.3
jsonschema==4.14.0
modin @ git+https://github.com/modin-project/modin@636fc59e6820a36937b72cfb96ba9fa60d871fe4
msgpack==1.0.4
multidict==6.0.2
numpy==1.23.2
nvidia-ml-py==11.495.46
opencensus==0.11.0
opencensus-context==0.1.3
packaging==21.3
pandas==1.4.3
platformdirs==2.5.2
prometheus-client==0.13.1
protobuf==3.20.1
psutil==5.9.1
py-spy==0.3.12
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.9.2
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2022.2.1
PyYAML==6.0
ray==2.0.0
redis==3.5.3
requests==2.28.1
rsa==4.9
scipy==1.9.1
six==1.16.0
smart-open==6.1.0
typing_extensions==4.3.0
urllib3==1.26.12
virtualenv==20.16.3
wcwidth==0.2.5
wrapt==1.14.1
xgboost==1.6.2
xgboost-ray==0.1.10
yarl==1.8.1

Thanks Matthew. yes it also works for me with local but i am running in my remote clusters with “ray://…”. It will happen that error and my code as below


RAY_URL = 'ray://ray-head-svc.ray:10001'
extra_init_kw = {
"runtime_env": {
    # 'pip': ['modin[ray] @ git+https://github.com/modin-project/modin'],
    'pip': ['xgboost_ray']   
}
}
cpus_per_actor = 1
num_actors = 4
# ray.init(num_cpus=num_actors + 1)
ray.init(RAY_URL, **extra_init_kw)  
main(cpus_per_actor, num_actors)

The error as I listed the first post.

ValueError: Invalid data source type: <class ‘modin.pandas.dataframe.DataFrame’> with FileType: None for a distributed dataset.
FIX THIS by passing a supported data type. Supported data types for distributed datasets are a list of CSV or Parquet sources. If using Modin, Dask, or Petastorm, make sure the library is installed.

Thanks.

Can you try running main as a ray remote function?

Still error as below and i add the remote here. thanks.

@ray.remote
def main(cpus_per_actor, num_actors):
if not MODIN_INSTALLED:
print("Modin is not installed or installed in a version that is not "
“compatible with xgboost_ray (< 0.9.0).”)
return

# Import modin after initializing Ray
from modin.distributed.dataframe.pandas import from_partitions

# Generate dataset
x = np.repeat(range(8), 16).reshape((32, 4))
# Even numbers --> 0, odd numbers --> 1
y = np.tile(np.repeat(range(2), 4), 4)

# Flip some bits to reduce max accuracy
bits_to_flip = np.random.choice(32, size=6, replace=False)
y[bits_to_flip] = 1 - y[bits_to_flip]

data = pd.DataFrame(x)
data["label"] = y

# Split into 4 partitions
partitions = [ray.put(part) for part in np.split(data, 4)]

# Create modin df here
modin_df = from_partitions(partitions, axis=0)

train_set = RayDMatrix(modin_df, "label")

evals_result = {}
# Set XGBoost config.
xgboost_params = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
}

# Train the classifier
bst = train(
    params=xgboost_params,
    dtrain=train_set,
    evals=[(train_set, "train")],
    evals_result=evals_result,
    ray_params=RayParams(
        max_actor_restarts=0,
        gpus_per_actor=0,
        cpus_per_actor=cpus_per_actor,
        num_actors=num_actors),
    verbose_eval=False,
    num_boost_round=10)

model_path = "modin.xgb"
bst.save_model(model_path)
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))

if name == “main”:

RAY_URL = 'ray://ray-head-svc.ray:10001'
extra_init_kw = {
"runtime_env": {
    # 'pip': ['modin[ray] @ git+https://github.com/modin-project/modin'],
    # 'pip': ['ray[air]'],
    'pip': ['xgboost_ray']   
}
}
cpus_per_actor = 1
num_actors = 4
# ray.init(num_cpus=num_actors + 1)
ray.init(address=RAY_URL, **extra_init_kw)  

main.remote(cpus_per_actor, num_actors)

Unhandled error (suppress with ‘RAY_IGNORE_UNHANDLED_ERRORS=1’): ray::main() (pid=854, ip=192.168.134.40)
File “/tmp/ipykernel_31574/317016564.py”, line 40, in main
File “/tmp/ray/session_2022-08-29_13-05-07_938062_7/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/main.py”, line 1385, in train
dtrain.load_data(ray_params.num_actors)
File “/tmp/ray/session_2022-08-29_13-05-07_938062_7/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 778, in load_data
refs, self.n = self.loader.load_data(
File “/tmp/ray/session_2022-08-29_13-05-07_938062_7/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 334, in load_data
data_source = self.get_data_source()
File “/tmp/ray/session_2022-08-29_13-05-07_938062_7/runtime_resources/pip/a8e57680f27af79b38868e663e15b85d89590602/virtualenv/lib/python3.9/site-packages/xgboost_ray/matrix.py”, line 293, in get_data_source
raise ValueError(
ValueError: Unknown data source type: <class ‘modin.pandas.dataframe.DataFrame’> with FileType: None.
FIX THIS by passing a supported data type. Supported data types include pandas.DataFrame, pandas.Series, np.ndarray, and CSV/Parquet file paths. If you specify a file, path, consider passing the filetype argument to specify the type of the source. Use the RayFileType enum for that. If using Modin, Dask, or Petastorm, make sure the library is installed.
(main pid=854) UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable MODIN_AUTOIMPORT_PANDAS to 1
ray.shutdown()

Could you try adding that env var to your runtime enviroment?

"runtime_env": {
    # 'pip': ['modin[ray] @ git+https://github.com/modin-project/modin'],
    # 'pip': ['ray[air]'],
    'pip': ['xgboost_ray'],
    "env_vars": {"MODIN_AUTOIMPORT_PANDAS": "1"}
}