Ray train job gets killed with no errors!

milad_heidari · August 17, 2023, 11:20am

I tried to run a single XGBoost training job but the job gets killed without any errors/exceptions. checked the logs but it seems nothing is wrong. How should I find the problem? Is there any way to increase the log level?
My script and screenshots from the terminal and Ray’s dashboard are attached.

Code:

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

scaling_config = ScalingConfig(
    # Number of workers to use for data parallelism.
    num_workers=2,
    # Whether to use GPU acceleration.
    use_gpu=False,
)

sync_config = SyncConfig(
    syncer=CustomCommandSyncer(
        sync_up_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL  sync {source} {target}",
        sync_down_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL sync {source} {target}",
        delete_template="aws s3 --endpoint-url=$AWS_ENDPOINT_URL rm {target} --recursive",
    ),
)

trainer = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=RunConfig(
        sync_config=sync_config,
        verbose=1,
        storage_path="s3://mlops/milad/"
    ),
    label_column="target",
    num_boost_round=10,
    params={
        # XGBoost specific params
        "objective": "binary:logistic",
        # "tree_method": "gpu_hist",  # uncomment this to use GPU for training
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)

result = trainer.fit()

print(result.metrics)

Terminal:
https://mega.nz/file/XJkjyTYb#8AFhRnFTxzojkUIHlkE-wW8q-qRSeFnE7CVJaDVOA_A

Dashboard:
https://mega.nz/file/mMNmTKDC#ePShAMUO-WCtcOkyx8zCsys9wQqT8N6DhjFP0J8ejTs

matthewdeng · August 19, 2023, 4:20pm

Hey it’s not clear to me from the logs that you’ve shown - can you copy/paste the full log from the Dashboard view?

milad_heidari · August 19, 2023, 4:42pm

Update

Thanks for the response.
The problem was insufficient memory of the head node. After I increased the memory of the head node, the problem was solved. I wonder how can we detect such problems. The logs on the dashboard don’t say anything about the memory shortage or insufficient CPU cores.

OmarAlmusa · May 19, 2025, 5:53am

how did you increase the memory of the head node?

Topic		Replies	Views
Ray xgboost ray not use GPU training and OOM Ray Train	0	140	April 30, 2024
DEADLINE_EXCEEDED when training using xgboost_ray on Sagemaker Ray Train	2	348	November 30, 2023
WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information Ray Train	0	15	May 18, 2025
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4331	May 8, 2023
[Ray Train] XGBoostTrainer crashes with ActorDiedError when using num_workers > 1 and use_gpu=False Ray Train	0	15	May 26, 2025

Ray train job gets killed with no errors!

Related topics