Ray xgboost ray not use GPU training and OOM

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to run Ray on AWS SageMaker and then use GPU to train XGBoost.
the configuration is from aws-samples-for-ray/sagemaker/distributed-xgb-sm-pipeline/pipeline_scripts/train/script.py at main · aws-samples/aws-samples-for-ray · GitHub.
But i used an 1GB open source dataset(5GB if decompressed) and changed the scripts to ‘train_xgboost_airline.py’ as follow

# train_xgboost_airline.py
import os
import time

import ray
import pandas as pd
import numpy as np	
from sklearn.model_selection import train_test_split
from xgboost_ray import train, RayDMatrix, RayParams

from sagemaker_ray_helper import RayHelper

FILENAME = os.path.join(os.environ.get("SM_CHANNEL_TRAIN"), "airline_14col.data.bz2")
MODEL_DIR = os.environ["SM_MODEL_DIR"]

max_depth = 6
learning_rate = 0.1
min_split_loss = 0
min_weight = 1
l1_reg = 0
l2_reg = 1

def get_airline(num_rows=None):
	cols = [
		"Year", "Month", "DayofMonth", "DayOfWeek", "CRSDepTime",
		"CRSArrTime", "UniqueCarrier", "FlightNum", "ActualElapsedTime",
		"Origin", "Dest", "Distance", "Diverted", "ArrDelay"

	dtype = np.int16

	dtype_columns = {
		"Year": dtype, "Month": dtype, "DayofMonth": dtype, "DayofWeek": dtype,
        "CRSDepTime": dtype, "CRSArrTime": dtype, "FlightNum": dtype,
        "ActualElapsedTime": dtype, "Distance": dtype,
        "Diverted": dtype, "ArrDelay": dtype,

	df = pd.read_csv(FILENAME, names=cols, dtype=dtype_columns, nrows=num_rows)

	# Encode categoricals as numeric
	for col in df.select_dtypes(['object']).columns:
		df[col] = df[col].astype('category').cat.codes

	# Turn into binary classification problem
	df["ArrDelayBinary"] = 1* (df["ArrDelay"] > 0)

	X = df[df.columns.difference(["ArrDelay", "ArrDelayBinary"])]
	y = df["ArrDelayBinary"]

	del df
	return X, y

def main():
	X, y = get_airline()

	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
	X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

	print(f'X_train shape: {X_train.shape}, y_train shape: {y_train.shape}')
	print(f'X_test shape: {X_test.shape}, y_test shape: {y_test.shape}')
	print(f'X_val shape: {X_val.shape}, y_val shape: {y_val.shape}')
	dtrain = RayDMatrix(X_train, y_train)
	dtest = RayDMatrix(X_test, y_test)
	dval = RayDMatrix(X_val, y_val)

	print("data downloaded")

	config = {
		"max_depth": max_depth,
		"learning_rate": learning_rate,
		"min_split_loss": min_split_loss,
		"min_weight": min_weight,
		"alpha": l1_reg,
		"lambda": l2_reg,
		"tree_method": "gpu_hist",
		"objective": "binary:logistic",
		"eval_metric": ["logloss", "error"],

	evals_result = {}

	start = time.time()
	bst = train(
		evals=[(dtrain, "train"), (dval, "val")],
	print("Training time:", time.time() - start)
	bst.save_model(os.path.join(MODEL_DIR, "model.xgb"))
	print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))

if __name__ == "__main__":
	ray_helper = RayHelper()

	start = time.time()
	print("Total time:", time.time() - start)

the launch script is as follows.

pytorch_estimator = PyTorch(entry_point='train_xgboost_airline.py',

    "train": input_base_dir

the spec of ml.g5.2xlarge is 8CPU/32G RAM/450G DISK/24G 1GPU. the instance count i set is 2

You can see i set the tree_method as ‘gpu_hist’ and the gpus_per_actor is 1.
Q1. But i found the gpu is utilization is 0.

Q2. and sometimes the ray may oom with cluster resource
ml.g5.2xlarge has 32GB RAM per instance but from the cluster resource we can see the cluster memory is about 36GB which is about half of the total memory of two instances. and the cluster object store memory is about half of the cluster memory which is wired to me.

{'CPU': 16.0, 'GPU': 2.0, 'accelerator_type:A10G': 2.0, 'node:xxxxx': 1.0, 'object_store_memory': 18090914610.0, 'memory': 39212727093.0, 'node:xxxxx': 1.0}

the log

#033[2m#033[36m(raylet)#033[0m Spilled 5530 MiB, 32 objects, write throughput 205 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
#033[2m#033[36m(raylet)#033[0m Spilled 9955 MiB, 33 objects, write throughput 168 MiB/s.
#033[2m#033[33m(raylet)#033[0m [2024-04-30 03:09:12,910 E 211 211] (raylet) node_manager.cc:3071: 3 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (