DEADLINE_EXCEEDED when training using xgboost_ray on Sagemaker

Bhargav_Srinivasan · November 28, 2023, 10:13pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I’m trying to train a xgboost model on xgboost_ray in Sagemaker. The train works for a smaller dataset (11.5 GB train, 150 MB validation), but when I try to run on a larger dataset (230 GB train, 3 GB validation), I get an error (shared below).

The read from parquet usually works and reaches 100% albeit with some objects spilled
The script raises a DEADLINE_EXCEEDED error

I ran this on 15 instances of ml.m5.12xlarge which has enough memory to cover 3x the dataset size, 882 GB of object store memory and 2048 GB memory. Why would this fail on the larger dataset?

2023-11-24 01:18:51,518#011INFO usage_lib.py:416 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-11-24 01:18:51,519#011INFO scripts.py:744 -- #033[37mLocal node IP#033[39m: #033[1m172.xx.xx.xx#033[22m
2023-11-24 01:18:53,362#011SUCC scripts.py:781 -- #033[32m--------------------#033[39m
2023-11-24 01:18:53,362#011SUCC scripts.py:782 -- #033[32mRay runtime started.#033[39m
2023-11-24 01:18:53,362#011SUCC scripts.py:783 -- #033[32m--------------------#033[39m
2023-11-24 01:18:53,362#011INFO scripts.py:785 -- #033[36mNext steps#033[39m
2023-11-24 01:18:53,362#011INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-11-24 01:18:53,363#011INFO scripts.py:791 -- #033[1m  ray start --address='172.xx.xx.xx:9339'#033[22m
2023-11-24 01:18:53,363#011INFO scripts.py:800 -- To connect to this Ray cluster:
2023-11-24 01:18:53,363#011INFO scripts.py:802 -- #033[35mimport#033[39m#033[26m ray
2023-11-24 01:18:53,363#011INFO scripts.py:803 -- ray#033[35m.#033[39m#033[26minit()
2023-11-24 01:18:53,363#011INFO scripts.py:834 -- To terminate the Ray runtime, run
2023-11-24 01:18:53,363#011INFO scripts.py:835 -- #033[1m  ray stop#033[22m
2023-11-24 01:18:53,363#011INFO scripts.py:838 -- To view the status of the cluster, use
2023-11-24 01:18:53,363#011INFO scripts.py:839 --   #033[1mray status#033[22m#033[26m
2023-11-24 01:18:53,461#011INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 172.xx.xx.xx:9339...
2023-11-24 01:18:53,465#011INFO worker.py:1673 -- Connected to Ray cluster.
Waiting 60 seconds for 15 nodes to join
1 nodes connected to cluster
1 nodes connected to cluster
[2023-11-24 01:19:02,995 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:02,549 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,936 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,615 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,924 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,363 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,005 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,042 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,301 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
9 nodes connected to cluster
[2023-11-24 01:19:04,855 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,694 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:05,247 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:05,542 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:06,057 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
All workers present and accounted for
{'node:172.xx.xx.xx': 1.0, 'CPU': 720.0, 'object_store_memory': 882278433174.0, 'memory': 2048649677418.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   0%|          | 0/666 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   0%|          | 1/666 [00:02<23:36,  2.13s/it]
...
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:  99%|█████████▉| 662/666 [00:15<00:00, 54.62it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:   0%|          | 0/10 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:  10%|█         | 1/10 [00:00<00:08,  1.05it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:  20%|██        | 2/10 [00:01<00:03,  2.21it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:  50%|█████     | 5/10 [00:01<00:00,  6.50it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:  80%|████████  | 8/10 [00:01<00:00,  7.73it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 100%|██████████| 10/10 [00:01<00:00,  7.21it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   0%|          | 0/33 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   3%|▎         | 1/33 [00:01<00:40,  1.27s/it]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   6%|▌         | 2/33 [00:01<00:18,  1.71it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:   9%|▉         | 3/33 [00:01<00:11,  2.72it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:  30%|███       | 10/33 [00:01<00:01, 12.70it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:  58%|█████▊    | 19/33 [00:01<00:00, 25.86it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0:  82%|████████▏ | 27/33 [00:01<00:00, 36.14it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 100%|██████████| 33/33 [00:01<00:00, 40.58it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0:  50%|█████     | 1/2 [00:01<00:01,  1.27s/it]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 100%|██████████| 2/2 [00:01<00:00,  1.41it/s]
data loaded
{'memory': 2048649677418.0, 'object_store_memory': 882278433174.0, 'node:172.xx.xx.xx': 1.0, 'CPU': 720.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
{'object_store_memory': 882241833075.0, 'memory': 2048649677418.0, 'CPU': 720.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
Args: Namespace(eta=0.2, gamma=4, max_depth=20, min_child_weight=5, num_round=50, objective='reg:squarederror', subsample=0.8, verbosity=3, mlflow_tracking_uri='https://mlflow.xyz.com', experiment_name='xyz--xgboost-ray')
config: {'eta': 0.2, 'gamma': 4, 'max_depth': 20, 'min_child_weight': 5, 'subsample': 0.8, 'objective': 'reg:squarederror', 'verbosity': 3, 'eval_metric': ['logloss', 'error', 'rmse', 'mae']}
/opt/conda/lib/python3.10/site-packages/ray/train/_internal/session.py:638: UserWarning: `get_trial_id` is meant to only be called inside a function that is executed by a Tuner or Trainer. Returning `None`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/ray/train/_internal/session.py:638: UserWarning: `get_trial_name` is meant to only be called inside a function that is executed by a Tuner or Trainer. Returning `None`.
  warnings.warn(
2023-11-24 01:19:47,671#011INFO main.py:1140 -- [RayXGBoost] Created 13 new actors (13 total actors). Waiting until actors are ready for training.
2023-11-24 01:19:47,714#011INFO plan.py:757 -- Using autodetected parallelism=1440 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (720).
Read progress 0:   0%|          | 0/1440 [00:00<?, ?it/s]
Read progress 0:   0%|          | 3/1440 [00:01<09:01,  2.65it/s]
...
Read progress 0:  28%|██▊       | 402/1440 [01:51<16:56,  1.02it/s]
Read progress 0:  28%|██▊       | 403/1440 [01:54<23:29,  1.36s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 32795 MiB, 76 objects, write throughput 323 MiB/s.
Read progress 0:  28%|██▊       | 403/1440 [01:55<23:29,  1.36s/it]
Read progress 0:  28%|██▊       | 404/1440 [01:55<20:58,  1.21s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 32796 MiB, 81 objects, write throughput 318 MiB/s.
...
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 65737 MiB, 159 objects, write throughput 233 MiB/s.
Read progress 0:  39%|███▉      | 567/1440 [04:54<12:20,  1.18it/s]
Read progress 0:  39%|███▉      | 568/1440 [04:54<25:37,  1.76s/it]
Read progress 0:  40%|███▉      | 569/1440 [04:55<21:56,  1.51s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 65895 MiB, 158 objects, write throughput 233 MiB/s.
...
Read progress 0:  61%|██████▏   | 884/1440 [10:43<07:17,  1.27it/s]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 131486 MiB, 318 objects, write throughput 208 MiB/s.
Read progress 0:  61%|██████▏   | 884/1440 [10:44<07:17,  1.27it/s]
Read progress 0:  61%|██████▏   | 885/1440 [10:44<07:50,  1.18it/s]
Read progress 0:  62%|██████▏   | 886/1440 [10:45<08:23,  1.10it/s]
Read progress 0:  62%|██████▏   | 887/1440 [10:46<08:10,  1.13it/s]
Read progress 0:  62%|██████▏   | 895/1440 [10:46<02:40,  3.40it/s]
Read progress 0:  62%|██████▏   | 896/1440 [10:47<02:31,  3.60it/s]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 176218 MiB, 417 objects, write throughput 277 MiB/s.
Read progress 0:  62%|██████▏   | 896/1440 [10:47<02:31,  3.60it/s]
Read progress 0:  62%|██████▏   | 897/1440 [10:47<03:14,  2.79it/s]
Read progress 0:  63%|██████▎   | 903/1440 [10:48<01:31,  5.89it/s]
Read progress 0:  63%|██████▎   | 910/1440 [10:49<01:25,  6.16it/s]
Read progress 0:  63%|██████▎   | 912/1440 [10:49<01:37,  5.39it/s]
Read progress 0:  63%|██████▎   | 914/1440 [10:49<01:26,  6.05it/s]
#033[33m(autoscaler +11m53s)#033[0m Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Read progress 0:  94%|█████████▍| 1350/1440 [16:07<02:50,  1.89s/it]
Read progress 0:  94%|█████████▍| 1352/1440 [16:08<01:54,  1.30s/it]
Read progress 0:  94%|█████████▍| 1353/1440 [16:14<03:36,  2.49s/it]
Read progress 0:  96%|█████████▌| 1379/1440 [17:09<02:00,  1.98s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 274011 MiB, 651 objects, write throughput 269 MiB/s.
Read progress 0:  96%|█████████▌| 1379/1440 [17:12<02:00,  1.98s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 262444 MiB, 629 objects, write throughput 242 MiB/s.
Read progress 0:  98%|█████████▊| 1408/1440 [18:16<01:16,  2.40s/it]
Read progress 0: 100%|█████████▉| 1438/1440 [19:14<00:04,  2.29s/it]
Read progress 0: 100%|█████████▉| 1439/1440 [19:14<00:01,  1.80s/it]
Read progress 0: 100%|██████████| 1440/1440 [19:21<00:00,  3.20s/it]
2023-11-24 01:39:34,889#011INFO plan.py:757 -- Using autodetected parallelism=1440 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (720).
2023-11-24 01:39:34,889#011INFO plan.py:762 -- To satisfy the requested parallelism of 1440, each read task output is split into 8 smaller blocks.
Traceback (most recent call last):
  File "/opt/ml/code/train_xgboost.py", line 136, in <module>
train_xgboost(args)
  File "/opt/ml/code/train_xgboost.py", line 70, in train_xgboost
    bst = train(
  File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py", line 1612, in train
bst, train_evals_result, train_additional_results = _train(
  File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py", line 1152, in _train
deval.assign_shards_to_actors(_training_state.actors)
  File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 884, in assign_shards_to_actors
success = self.loader.assign_shards_to_actors(actors)
  File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 595, in assign_shards_to_actors
data, actor_shards = data_source.get_actor_shards(self.data, actors)
  File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/data_sources/ray_dataset.py", line 95, in get_actor_shards
dataset_splits = data.split(
  File "/opt/conda/lib/python3.10/site-packages/ray/data/dataset.py", line 1344, in split
blocks = self._plan.execute()
  File "/opt/conda/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 646, in execute
reply = get_memory_info_reply(
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/internal_api.py", line 82, in get_memory_info_reply
reply = stub.FormatGlobalMemoryInfo(
  File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 1161, in __call__
return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError
: <_InactiveRpcError of RPC that terminated with:
#011status = StatusCode.DEADLINE_EXCEEDED
#011details = "Deadline Exceeded"
#011debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-11-24T01:40:34.892952909+00:00", grpc_status:4}"
>
2023-11-24 01:40:36,491 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2023-11-24 01:40:36,492 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/opt/conda/bin/python3.10 train_xgboost.py --eta 0.2 --experiment_name xyz --gamma 4 --max_depth 20 --min_child_weight 5 --mlflow_tracking_uri https://mlflow.xyz.com --num_round 50 --objective reg:squarederror --subsample 0.8 --verbosity 3"
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-11-24 01:41:02 Uploading - Uploading generated training model
2023-11-24 01:41:02 Failed - Training job failed

matthewdeng · November 29, 2023, 1:21am

Would you be able to try with a nightly version of Ray? May be related to this [Core] Data jobs able to trigger persistent mmap file leak in between job runs · Issue #39229 · ray-project/ray · GitHub

Bhargav_Srinivasan · November 30, 2023, 11:26pm

I used the nightly version but I’m still seeing the same error. I noticed the memory usage is very low during the run (climbed from 2 to 7%) and disk utilization was increasing slowly, reached 50% usage out of 500GB. Why are so many objects spilling over if my ray resources look like this

{'object_store_memory': 499927029984.0, 'GPU': 8.0, 'memory': 662909887488.0, 'CPU': 96.0, 'accelerator_type:A100': 1.0, 'node:__internal_head__': 1.0}

Error log

timestamp,message
1701294901023,"#033[36m(raylet)#033[0m Spilled 8387 MiB, 56 objects, write throughput 756 MiB/s."
1701294927030,"#033[36m(raylet)#033[0m Spilled 17064 MiB, 120 objects, write throughput 808 MiB/s."
1701294972041,"#033[36m(raylet)#033[0m Spilled 33747 MiB, 235 objects, write throughput 816 MiB/s."
1701295029056,"#033[36m(raylet)#033[0m Spilled 146286 MiB, 1056 objects, write throughput 2095 MiB/s."
1701295285117,"#033[36m(raylet)#033[0m Spilled 147134 MiB, 1062 objects, write throughput 2070 MiB/s."
1701295362136,"#033[36m(raylet)#033[0m Spilled 464906 MiB, 3293 objects, write throughput 3928 MiB/s."
1701295440155,"#033[36m(raylet)#033[0m Spilled 558975 MiB, 3959 objects, write throughput 2839 MiB/s."
1701296876500,"#033[36m(raylet)#033[0m Spilled 1127888 MiB, 7981 objects, write throughput 2528 MiB/s."
1701299583155,"#033[36m(raylet)#033[0m Spilled 2191714 MiB, 15504 objects, write throughput 2729 MiB/s."
1701302297811,"Read progress 0:   0%|          | 0/4000 [00:00<?, ?it/s]"
1701302297811,"Read progress 0:   2%|▏         | 87/4000 [00:00<00:04, 864.49it/s]"
1701302297811,"Read progress 0:   4%|▍         | 180/4000 [00:00<00:04, 897.56it/s]"
1701302297811,"Read progress 0:   7%|▋         | 274/4000 [00:00<00:04, 915.90it/s]"
1701302297811,"Read progress 0:   9%|▉         | 375/4000 [00:00<00:03, 952.81it/s]"
1701302297811,"Read progress 0:  12%|█▏        | 483/4000 [00:00<00:03, 998.10it/s]"
1701302297811,"Read progress 0:  15%|█▍        | 595/4000 [00:00<00:03, 1037.09it/s]"
1701302297811,"Read progress 0:  18%|█▊        | 709/4000 [00:00<00:03, 1069.24it/s]"
1701302297811,"Read progress 0:  21%|██        | 826/4000 [00:00<00:02, 1100.41it/s]"
1701302298812,"Read progress 0:  24%|██▎       | 947/4000 [00:00<00:02, 1132.22it/s]"
1701302298812,"Read progress 0:  27%|██▋       | 1071/4000 [00:01<00:02, 1164.25it/s]"
1701302298812,"Read progress 0:  30%|██▉       | 1199/4000 [00:01<00:02, 1198.62it/s]"
1701302298812,"Read progress 0:  33%|███▎      | 1332/4000 [00:01<00:02, 1237.05it/s]"
1701302298812,"Read progress 0:  37%|███▋      | 1470/4000 [00:01<00:01, 1277.83it/s]"
1701302298812,"Read progress 0:  40%|████      | 1613/4000 [00:01<00:01, 1322.95it/s]"
1701302298812,"Read progress 0:  44%|████▍     | 1763/4000 [00:01<00:01, 1375.57it/s]"
1701302298812,"Read progress 0:  48%|████▊     | 1922/4000 [00:01<00:01, 1438.69it/s]"
1701302298812,"Read progress 0:  52%|█████▏    | 2090/4000 [00:01<00:01, 1508.47it/s]"
1701302298812,"Read progress 0:  57%|█████▋    | 2270/4000 [00:01<00:01, 1594.44it/s]"
1701302299812,"Read progress 0:  62%|██████▏   | 2463/4000 [00:01<00:00, 1692.87it/s]"
1701302299812,"Read progress 0:  67%|██████▋   | 2673/4000 [00:02<00:00, 1814.57it/s]"
1701302299812,"Read progress 0:  73%|███████▎  | 2917/4000 [00:02<00:00, 2000.42it/s]"
1701302299812,"Read progress 0:  80%|████████  | 3217/4000 [00:02<00:00, 2299.58it/s]"
1701302299812,"Read progress 0:  90%|█████████ | 3615/4000 [00:02<00:00, 2802.51it/s]"
1701302300813,"Read progress 0: 100%|█████████▉| 3997/4000 [00:03<00:00, 724.08it/s]"
1701302312815,"Read progress 0: 100%|██████████| 4000/4000 [00:14<00:00, 724.08it/s]"
1701302335820,"2023-11-29 23:58:55,514#011INFO plan.py:760 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200."
1701302395833,"Traceback (most recent call last):
  File ""/opt/ml/code/train_xgboost.py"", line 139, in <module>"
1701302395834,"train_xgboost(args)
  File ""/opt/ml/code/train_xgboost.py"", line 71, in train_xgboost"
1701302395834,"bst = train(
  File ""/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py"", line 1612, in train"
1701302395834,"bst, train_evals_result, train_additional_results = _train(
  File ""/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py"", line 1152, in _train"
1701302395834,"deval.assign_shards_to_actors(_training_state.actors)
  File ""/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py"", line 884, in assign_shards_to_actors"
1701302395834,"success = self.loader.assign_shards_to_actors(actors)
  File ""/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py"", line 595, in assign_shards_to_actors"
1701302395834,"data, actor_shards = data_source.get_actor_shards(self.data, actors)
  File ""/opt/conda/lib/python3.10/site-packages/xgboost_ray/data_sources/ray_dataset.py"", line 95, in get_actor_shards"
1701302395834,"dataset_splits = data.split(
  File ""/opt/conda/lib/python3.10/site-packages/ray/data/dataset.py"", line 1361, in split"
1701302395834,"blocks = self._plan.execute()
  File ""/opt/conda/lib/python3.10/site-packages/ray/data/_internal/plan.py"", line 649, in execute"
1701302395834,"reply = get_memory_info_reply(
  File ""/opt/conda/lib/python3.10/site-packages/ray/_private/internal_api.py"", line 82, in get_memory_info_reply"
1701302395834,"reply = stub.FormatGlobalMemoryInfo(
  File ""/opt/conda/lib/python3.10/site-packages/grpc/_channel.py"", line 1161, in __call__"
1701302395834,"return _end_unary_response_blocking(state, call, False, None)
  File ""/opt/conda/lib/python3.10/site-packages/grpc/_channel.py"", line 1004, in _end_unary_response_blocking"
1701302395834,raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
1701302395834,grpc._channel._InactiveRpcError
1701302395834,: <_InactiveRpcError of RPC that terminated with:
1701302395834,#011status = StatusCode.DEADLINE_EXCEEDED
1701302395834,"#011details = ""Deadline Exceeded"""
1701302395834,"#011debug_error_string = ""UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:""2023-11-29T23:59:55.518561117+00:00""}"""
1701302395834,>
1701302436843,"2023-11-30 00:00:36,364 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code."
1701302436843,"2023-11-30 00:00:36,364 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process."
1701302436843,"2023-11-30 00:00:36,364 sagemaker-training-toolkit ERROR    Reporting training FAILURE"
1701302436843,"2023-11-30 00:00:36,364 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:"
1701302436843,ExitCode 1
1701302436843,"ErrorMessage """""
1701302436843,"2023-11-30 00:00:36,364 sagemaker-training-toolkit ERROR    Encountered exit_code 1"

Topic		Replies	Views
RayTaskError (TypeError)	12	847	February 1, 2024
XGBoost + large data with Ray Tune: How? Ray Data	0	98	July 29, 2025
Ray xgboost ray not use GPU training and OOM Ray Train	0	157	April 30, 2024
A worker died or was killed while executing a task by an unexpected system error Ray Tune	6	4697	May 8, 2023
Loading large datasets from HDFS for xgboost on Yarn Ray Data	2	790	October 14, 2023

DEADLINE_EXCEEDED when training using xgboost_ray on Sagemaker

Related topics