Hello Ray Community,
We are training XGboost ray model but facing following issues, would be great if someone can guide us. We are using ray 2.3.1 on python 3.9.16
Dataset
- train: 35GB parquet, 250GB on memory
- validation: 35GB parquet, 250GB on memory
Context: - Machine specification: 1TB RAM, 128 CPUs, 3.78TB local NVMe (SSD PCIe Gen4).
- We are using Ray Datasets instead of RayDMatrix based on this Ray documention.
- Codeflow is as below fashion:
ray.init(_system_config={
"max_io_workers": 4,
"min_spilling_size": 100 * 1024 * 1024, #spill atleast 100MB
"object_spilling_config":json.dumps({"type":"filesystem",
"params":{
"directory_path": "<local_ssd_path>/ray_spillage_dir",
},
"buffer_size": 100 * 1024 * 1024,
})
},
_plasma_directory="<local_ssd_path>/ray_plasma_dir",
object_store_memory=300000000000.0, #300GB
_memory=600000000000.0) #600GB
train_dataset = ray.data.read_parquet(train_files__, columns = features__, schema = custom_schema) # int32t datatype
validation_dataset = ray.data.read_parquet(validation_files__, columns = features__, schema = custom_schema) #int32t datatype
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(
# Number of workers to use for data parallelism.
num_workers=10,
resources_per_worker={"CPU": 12},
# _max_cpu_fraction_per_node=0.8,
use_gpu=False,
),
label_column="time_curr_obs",
num_boost_round=50,
params={
# XGBoost specific params
"tree_method": "hist",
"max_depth":"8",
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset, "valid": validation_dataset},
)
result = trainer.fit()
Issues being observed:
1.
The CPU usage during the load data phase is very low and one RayXGBoostActor.load_data is created in the intial loading phase when data is converted to ray objects.
How can we increase the #load_data actors? Even after setting the num_workers to 10 only one load_data actor is created.
In case of RayDMatrix (instead of Ray Datasets) increasing the num_actors would lead to creation of than many load_data actors. However here it seems num_workers are used only during training,
-
If we do not pass the plasma directory then the time taken in the ray objects creation and spillings is around 10x faster compared to its path in local SSD.
We have noticed that in this case /dev/shm is used by default. But training is very slow with this as it complains that the system is low on memory. -
We have tried tuning the “max_io_workers”: 4 parameter but this isn’t increasing the runtime nor the write throughput during the ray_objects creation and spilling.
What all parameters canbe tuned to optimize this?
-
How should _max_cpu_fraction_per_node be used in combination with num_workers and resources_per_worker, as we tuning this parameter doesn’t seem to change the runtime, but this is prompted in the warning of the run.
-
Can we pass num_actors for data parallelism while ray object creation and spilling in dmatrix_params parameter of XGBoostTrainer ?