Tuning Settings for Big Data

I am looking for help with tuning Ray data / Torch Trainer for big tabular data. The training set is 1 billion rows across 440 features. I keep running into OOM issues. I have a cluster of 4 nodes, with 160 CPUs, 32 GPUs, 2TB of RAM and 8 TB of disk.

Below is my latest output which crashes not too long after this message.

e[2me[36m(TorchTrainer pid=173665)e[0m Created DatasetPipeline with 5 windows: 285.04GiB min, 368.94GiB max, 352.09GiB mean

e[2me[36m(TorchTrainer pid=173665)e[0m Blocks per window: 1206 min, 1489 max, 1430 mean

e[2me[36m(TorchTrainer pid=173665)e[0m ✔️ This pipeline's per-window parallelism is high enough to fully utilize the cluster.

e[2me[36m(TorchTrainer pid=173665)e[0m ⚠️ This pipeline's windows are ~352.09GiB in size each and may not fit in object store memory without spilling. To improve performance, consider reducing the size of each window to 186.26GiB or less.

Starting to train with a dataset of size: 1047826762!