XGboost-Ray Object Creation and Spilling bottleneck

Hello Ray Community,
We are training XGboost ray model but facing following issues, would be great if someone can guide us. We are using ray 2.3.1 on python 3.9.16

  1. train: 35GB parquet, 250GB on memory
  2. validation: 35GB parquet, 250GB on memory
  3. Machine specification: 1TB RAM, 128 CPUs, 3.78TB local NVMe (SSD PCIe Gen4).
  4. We are using Ray Datasets instead of RayDMatrix based on this Ray documention.
  5. Codeflow is as below fashion:
         "max_io_workers": 4,
         "min_spilling_size": 100 * 1024 * 1024, #spill atleast 100MB 
                                        "directory_path": "<local_ssd_path>/ray_spillage_dir",
                                    "buffer_size": 100 * 1024 * 1024,
         object_store_memory=300000000000.0, #300GB
         _memory=600000000000.0)  #600GB

train_dataset = ray.data.read_parquet(train_files__, columns = features__, schema = custom_schema) # int32t datatype
validation_dataset = ray.data.read_parquet(validation_files__, columns = features__, schema = custom_schema) #int32t datatype

trainer = XGBoostTrainer(
        # Number of workers to use for data parallelism.
        resources_per_worker={"CPU": 12},
        # _max_cpu_fraction_per_node=0.8,
        # XGBoost specific params
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    datasets={"train": train_dataset, "valid": validation_dataset},
result = trainer.fit()

Issues being observed:

The CPU usage during the load data phase is very low and one RayXGBoostActor.load_data is created in the intial loading phase when data is converted to ray objects.

How can we increase the #load_data actors? Even after setting the num_workers to 10 only one load_data actor is created.
In case of RayDMatrix (instead of Ray Datasets) increasing the num_actors would lead to creation of than many load_data actors. However here it seems num_workers are used only during training,

  1. If we do not pass the plasma directory then the time taken in the ray objects creation and spillings is around 10x faster compared to its path in local SSD.
    We have noticed that in this case /dev/shm is used by default. But training is very slow with this as it complains that the system is low on memory.

  2. We have tried tuning the “max_io_workers”: 4 parameter but this isn’t increasing the runtime nor the write throughput during the ray_objects creation and spilling.

What all parameters canbe tuned to optimize this?

  1. How should _max_cpu_fraction_per_node be used in combination with num_workers and resources_per_worker, as we tuning this parameter doesn’t seem to change the runtime, but this is prompted in the warning of the run.

  2. Can we pass num_actors for data parallelism while ray object creation and spilling in dmatrix_params parameter of XGBoostTrainer ?

Thanks for writing this up!

My understanding is RayXGBoostActor should be equal to num_actors. See code here.

And every RayXGBoostActor should only be responsible for loading its own shard. Could you paste the ray dashboard output (with RayXGBoostActor.load_data)? The current screenshot doesn’t have that part. It also seems that RayXGBoostActor.train is only showing up once? Although it’s distributed across multiple actors as well. So maybe this is related to how the dashboard visuals are interpreted?

Now to dataset spilling. It’s probably going to be an issue if bulk ingestion is used (as in this case). The AIR team is trying to enable streaming ingestion with training.

cc @amogkam

+1. I believe the legend RayXGBoostActor.train in the graph refers to the aggregated usage of cpus across all the actors called RayXGBoostActor.train. It doesn’t mean that there is only 1 actor. You should look at the actor table in the job detail page to confirm the number of RayXGBoostActor.load_data. cc: @sangcho

1 Like

Thanks, we have verified that the XGBoostActor are created as per the num_workers parameter and then these actors themselves first run the load_data function and then the train function. As described above currently the CPU usage is very low during the data loading phase even though both the spillage and plasma directory (both being on the local ssd supporting pci4) aren’t maxed out. This is leading to our current run time distribution being, 3/4 (25 mins in this case) for creating ray_objects from parquet files and 1/4 for training phase.
This doesn’t seem to be the max performance we can extract out of XGBoost. Are we missing some optimization during the loading phase ?

Yes, what you observed is entirely legit.

There are a few approaches I can think of and I probably need to ask our datasets team for more insights. So stay tuned. On a high level:

  • we can stick with bulk ingestion, which likely will incur disk spill to some extent. Now I need to check with the team on whether we read parquet files using parallelization of 10. Can we improve throughput by having more workers read from parquet? btw, are your parquet files on local disk or on remote storage? Another aspect is disk spilling - is how fast disk spill limiting the loading throughput?
  • we can explore streaming injection so that load_data and training can be overlapped together. It will also likely remove disk spill aspect of things. I need to check if XGBoostTrainer (or the xgboost-ray wrapped under it) supports this kind of batch iteration semantics during training.

These are my thoughts. I will get back to you with more details after getting some input from our datasets team.

Any updates on stream/batch training xgboost to avoid loading giant data sets / spilling?