How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m trying to train a xgboost model on xgboost_ray in Sagemaker. The train works for a smaller dataset (11.5 GB train, 150 MB validation), but when I try to run on a larger dataset (230 GB train, 3 GB validation), I get an error (shared below).
- The read from parquet usually works and reaches 100% albeit with some objects spilled
- The script raises a DEADLINE_EXCEEDED error
I ran this on 15 instances of ml.m5.12xlarge which has enough memory to cover 3x the dataset size, 882 GB of object store memory and 2048 GB memory. Why would this fail on the larger dataset?
2023-11-24 01:18:51,518#011INFO usage_lib.py:416 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-11-24 01:18:51,519#011INFO scripts.py:744 -- #033[37mLocal node IP#033[39m: #033[1m172.xx.xx.xx#033[22m
2023-11-24 01:18:53,362#011SUCC scripts.py:781 -- #033[32m--------------------#033[39m
2023-11-24 01:18:53,362#011SUCC scripts.py:782 -- #033[32mRay runtime started.#033[39m
2023-11-24 01:18:53,362#011SUCC scripts.py:783 -- #033[32m--------------------#033[39m
2023-11-24 01:18:53,362#011INFO scripts.py:785 -- #033[36mNext steps#033[39m
2023-11-24 01:18:53,362#011INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-11-24 01:18:53,363#011INFO scripts.py:791 -- #033[1m ray start --address='172.xx.xx.xx:9339'#033[22m
2023-11-24 01:18:53,363#011INFO scripts.py:800 -- To connect to this Ray cluster:
2023-11-24 01:18:53,363#011INFO scripts.py:802 -- #033[35mimport#033[39m#033[26m ray
2023-11-24 01:18:53,363#011INFO scripts.py:803 -- ray#033[35m.#033[39m#033[26minit()
2023-11-24 01:18:53,363#011INFO scripts.py:834 -- To terminate the Ray runtime, run
2023-11-24 01:18:53,363#011INFO scripts.py:835 -- #033[1m ray stop#033[22m
2023-11-24 01:18:53,363#011INFO scripts.py:838 -- To view the status of the cluster, use
2023-11-24 01:18:53,363#011INFO scripts.py:839 -- #033[1mray status#033[22m#033[26m
2023-11-24 01:18:53,461#011INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 172.xx.xx.xx:9339...
2023-11-24 01:18:53,465#011INFO worker.py:1673 -- Connected to Ray cluster.
Waiting 60 seconds for 15 nodes to join
1 nodes connected to cluster
1 nodes connected to cluster
[2023-11-24 01:19:02,995 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:02,549 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,936 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,615 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:03,924 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,363 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,005 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,042 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,301 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
9 nodes connected to cluster
[2023-11-24 01:19:04,855 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:04,694 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:05,247 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:05,542 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
[2023-11-24 01:19:06,057 I 132 132] global_state_accessor.cc:374: This node has an IP address of 172.xx.xx.xx, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
All workers present and accounted for
{'node:172.xx.xx.xx': 1.0, 'CPU': 720.0, 'object_store_memory': 882278433174.0, 'memory': 2048649677418.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 0%| | 0/666 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 0%| | 1/666 [00:02<23:36, 2.13s/it]
...
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 99%|█████████▉| 662/666 [00:15<00:00, 54.62it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 0%| | 0/10 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 10%|█ | 1/10 [00:00<00:08, 1.05it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 20%|██ | 2/10 [00:01<00:03, 2.21it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 50%|█████ | 5/10 [00:01<00:00, 6.50it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 80%|████████ | 8/10 [00:01<00:00, 7.73it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 100%|██████████| 10/10 [00:01<00:00, 7.21it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 0%| | 0/33 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 3%|▎ | 1/33 [00:01<00:40, 1.27s/it]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 6%|▌ | 2/33 [00:01<00:18, 1.71it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 9%|▉ | 3/33 [00:01<00:11, 2.72it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 30%|███ | 10/33 [00:01<00:01, 12.70it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 58%|█████▊ | 19/33 [00:01<00:00, 25.86it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 82%|████████▏ | 27/33 [00:01<00:00, 36.14it/s]
#033[2m#033[36m(pid=412) #033[0mMetadata Fetch Progress 0: 100%|██████████| 33/33 [00:01<00:00, 40.58it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 0%| | 0/2 [00:00<?, ?it/s]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 50%|█████ | 1/2 [00:01<00:01, 1.27s/it]
#033[2m#033[36m(pid=412) #033[0mParquet Files Sample 0: 100%|██████████| 2/2 [00:01<00:00, 1.41it/s]
data loaded
{'memory': 2048649677418.0, 'object_store_memory': 882278433174.0, 'node:172.xx.xx.xx': 1.0, 'CPU': 720.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
{'object_store_memory': 882241833075.0, 'memory': 2048649677418.0, 'CPU': 720.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:__internal_head__': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0, 'node:172.xx.xx.xx': 1.0}
Args: Namespace(eta=0.2, gamma=4, max_depth=20, min_child_weight=5, num_round=50, objective='reg:squarederror', subsample=0.8, verbosity=3, mlflow_tracking_uri='https://mlflow.xyz.com', experiment_name='xyz--xgboost-ray')
config: {'eta': 0.2, 'gamma': 4, 'max_depth': 20, 'min_child_weight': 5, 'subsample': 0.8, 'objective': 'reg:squarederror', 'verbosity': 3, 'eval_metric': ['logloss', 'error', 'rmse', 'mae']}
/opt/conda/lib/python3.10/site-packages/ray/train/_internal/session.py:638: UserWarning: `get_trial_id` is meant to only be called inside a function that is executed by a Tuner or Trainer. Returning `None`.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/ray/train/_internal/session.py:638: UserWarning: `get_trial_name` is meant to only be called inside a function that is executed by a Tuner or Trainer. Returning `None`.
warnings.warn(
2023-11-24 01:19:47,671#011INFO main.py:1140 -- [RayXGBoost] Created 13 new actors (13 total actors). Waiting until actors are ready for training.
2023-11-24 01:19:47,714#011INFO plan.py:757 -- Using autodetected parallelism=1440 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (720).
Read progress 0: 0%| | 0/1440 [00:00<?, ?it/s]
Read progress 0: 0%| | 3/1440 [00:01<09:01, 2.65it/s]
...
Read progress 0: 28%|██▊ | 402/1440 [01:51<16:56, 1.02it/s]
Read progress 0: 28%|██▊ | 403/1440 [01:54<23:29, 1.36s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 32795 MiB, 76 objects, write throughput 323 MiB/s.
Read progress 0: 28%|██▊ | 403/1440 [01:55<23:29, 1.36s/it]
Read progress 0: 28%|██▊ | 404/1440 [01:55<20:58, 1.21s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 32796 MiB, 81 objects, write throughput 318 MiB/s.
...
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 65737 MiB, 159 objects, write throughput 233 MiB/s.
Read progress 0: 39%|███▉ | 567/1440 [04:54<12:20, 1.18it/s]
Read progress 0: 39%|███▉ | 568/1440 [04:54<25:37, 1.76s/it]
Read progress 0: 40%|███▉ | 569/1440 [04:55<21:56, 1.51s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 65895 MiB, 158 objects, write throughput 233 MiB/s.
...
Read progress 0: 61%|██████▏ | 884/1440 [10:43<07:17, 1.27it/s]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 131486 MiB, 318 objects, write throughput 208 MiB/s.
Read progress 0: 61%|██████▏ | 884/1440 [10:44<07:17, 1.27it/s]
Read progress 0: 61%|██████▏ | 885/1440 [10:44<07:50, 1.18it/s]
Read progress 0: 62%|██████▏ | 886/1440 [10:45<08:23, 1.10it/s]
Read progress 0: 62%|██████▏ | 887/1440 [10:46<08:10, 1.13it/s]
Read progress 0: 62%|██████▏ | 895/1440 [10:46<02:40, 3.40it/s]
Read progress 0: 62%|██████▏ | 896/1440 [10:47<02:31, 3.60it/s]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 176218 MiB, 417 objects, write throughput 277 MiB/s.
Read progress 0: 62%|██████▏ | 896/1440 [10:47<02:31, 3.60it/s]
Read progress 0: 62%|██████▏ | 897/1440 [10:47<03:14, 2.79it/s]
Read progress 0: 63%|██████▎ | 903/1440 [10:48<01:31, 5.89it/s]
Read progress 0: 63%|██████▎ | 910/1440 [10:49<01:25, 6.16it/s]
Read progress 0: 63%|██████▎ | 912/1440 [10:49<01:37, 5.39it/s]
Read progress 0: 63%|██████▎ | 914/1440 [10:49<01:26, 6.05it/s]
#033[33m(autoscaler +11m53s)#033[0m Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Read progress 0: 94%|█████████▍| 1350/1440 [16:07<02:50, 1.89s/it]
Read progress 0: 94%|█████████▍| 1352/1440 [16:08<01:54, 1.30s/it]
Read progress 0: 94%|█████████▍| 1353/1440 [16:14<03:36, 2.49s/it]
Read progress 0: 96%|█████████▌| 1379/1440 [17:09<02:00, 1.98s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 274011 MiB, 651 objects, write throughput 269 MiB/s.
Read progress 0: 96%|█████████▌| 1379/1440 [17:12<02:00, 1.98s/it]
#033[36m(raylet, ip=172.xx.xx.xx)#033[0m Spilled 262444 MiB, 629 objects, write throughput 242 MiB/s.
Read progress 0: 98%|█████████▊| 1408/1440 [18:16<01:16, 2.40s/it]
Read progress 0: 100%|█████████▉| 1438/1440 [19:14<00:04, 2.29s/it]
Read progress 0: 100%|█████████▉| 1439/1440 [19:14<00:01, 1.80s/it]
Read progress 0: 100%|██████████| 1440/1440 [19:21<00:00, 3.20s/it]
2023-11-24 01:39:34,889#011INFO plan.py:757 -- Using autodetected parallelism=1440 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (720).
2023-11-24 01:39:34,889#011INFO plan.py:762 -- To satisfy the requested parallelism of 1440, each read task output is split into 8 smaller blocks.
Traceback (most recent call last):
File "/opt/ml/code/train_xgboost.py", line 136, in <module>
train_xgboost(args)
File "/opt/ml/code/train_xgboost.py", line 70, in train_xgboost
bst = train(
File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py", line 1612, in train
bst, train_evals_result, train_additional_results = _train(
File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/main.py", line 1152, in _train
deval.assign_shards_to_actors(_training_state.actors)
File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 884, in assign_shards_to_actors
success = self.loader.assign_shards_to_actors(actors)
File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 595, in assign_shards_to_actors
data, actor_shards = data_source.get_actor_shards(self.data, actors)
File "/opt/conda/lib/python3.10/site-packages/xgboost_ray/data_sources/ray_dataset.py", line 95, in get_actor_shards
dataset_splits = data.split(
File "/opt/conda/lib/python3.10/site-packages/ray/data/dataset.py", line 1344, in split
blocks = self._plan.execute()
File "/opt/conda/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 646, in execute
reply = get_memory_info_reply(
File "/opt/conda/lib/python3.10/site-packages/ray/_private/internal_api.py", line 82, in get_memory_info_reply
reply = stub.FormatGlobalMemoryInfo(
File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 1161, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError
: <_InactiveRpcError of RPC that terminated with:
#011status = StatusCode.DEADLINE_EXCEEDED
#011details = "Deadline Exceeded"
#011debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-11-24T01:40:34.892952909+00:00", grpc_status:4}"
>
2023-11-24 01:40:36,491 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-11-24 01:40:36,492 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/opt/conda/bin/python3.10 train_xgboost.py --eta 0.2 --experiment_name xyz --gamma 4 --max_depth 20 --min_child_weight 5 --mlflow_tracking_uri https://mlflow.xyz.com --num_round 50 --objective reg:squarederror --subsample 0.8 --verbosity 3"
2023-11-24 01:40:36,492 sagemaker-training-toolkit ERROR Encountered exit_code 1
2023-11-24 01:41:02 Uploading - Uploading generated training model
2023-11-24 01:41:02 Failed - Training job failed