I tried to run a single XGBoost training job but the job gets killed without any errors/exceptions. checked the logs but it seems nothing is wrong. How should I find the problem? Is there any way to increase the log level?
My script and screenshots from the terminal and Ray’s dashboard are attached.
Code:
# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
scaling_config = ScalingConfig(
# Number of workers to use for data parallelism.
num_workers=2,
# Whether to use GPU acceleration.
use_gpu=False,
)
sync_config = SyncConfig(
syncer=CustomCommandSyncer(
sync_up_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL sync {source} {target}",
sync_down_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL sync {source} {target}",
delete_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL rm {target} --recursive",
),
)
trainer = XGBoostTrainer(
scaling_config=scaling_config,
run_config=RunConfig(
sync_config=sync_config,
verbose=1,
storage_path="s3://mlops/milad/"
),
label_column="target",
num_boost_round=10,
params={
# XGBoost specific params
"objective": "binary:logistic",
# "tree_method": "gpu_hist", # uncomment this to use GPU for training
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()
print(result.metrics)
Terminal:
https://mega.nz/file/XJkjyTYb#8AFhRnFTxzojkUIHlkE-wW8q-qRSeFnE7CVJaDVOA_A
Dashboard:
https://mega.nz/file/mMNmTKDC#ePShAMUO-WCtcOkyx8zCsys9wQqT8N6DhjFP0J8ejTs