Large (5x) difference in Ray AIR memory usage on different machines

I’m using a fairly standard Ray tuning loop that builds a ray Dataset from a Pandas DataFrame and performs hyperparameter tuning for a TorchTrainer on that dataset.

The training loop works as intended on a simple dataset (appx. 100k rows, 180 columns, all float/int8 dtypes). However, I’ve noticed that the Ray AIR memory usage is very different on different hardware setups:

  • On my local machine (CPU-only), Ray AIR uses just under 10GB of memory for a short tuning run (one sample): Ray prints Memory usage on this node: 9.8/16.0 GiB to the terminal during the experiment.
  • On another machine (a40 GPU + 52CPU), Ray AIR uses over 55GB of memory to run the same experiment: Ray prints Memory usage on this node: 55.8/1007.1 GiB.

This extra memory usage results in frequent OOM errors when attempting to run experiments, but I am not sure how to diagnose the issue here. The environments in both cases should be identical, as I’m using conda, and the underlying data is identical as well.

Any suggestions as to what might be going on here, or how I can better diagnose the problem? Specifically I’m wondering why the Ray AIR runtime would use 5x memory on different hardware for an otherwise-identical run, suggestions on how to diagnose what is consuming the additional memory, and how to reduce the memory usage blowup (since I know the experiment can run with smaller footprint).

Thank you!

Have you tried running ray memory to see where the increased mem usage is coming from?

Could you provide a reproduction script? In general, Ray will try to use more memory as available for the object store to improve processing efficiency (and e.g., Tune will launch more concurrent trials on larger nodes). However, an OOM is unexpected.

Various metric graphs from the Ray dashboard (in Ray 2.1+) may also help for diagnosis.

Thanks for replies. The result of ray memory doesn’t seem to contain any useful information. It ends with

--- Aggregate object store stats across all nodes ---
Plasma memory usage 1609 MiB, 547 objects, 16.88% full, 16.88% needed
Objects consumed by Ray tasks: 1701 MiB.

which seems to suggest that the object store is not the cause of the issue.

It is also pretty difficult to interpret the results below, which print to the console during the Tuner.fit() run. Nothing seems to be taking up nearly the amount of memory that I would expect to cause an OOM on a machine with 250+GB of RAM:

The actor is dead because its worker process has died. Worker exit type: NODE_OUT_OF_MEMORY Worker exit detail: Task was killed du
e to the node running low on memory.
Memory on the node (IP: 128.208.6.50, ID: 25315a171b532338fc0de2327a3bf9ab3f5e5ebeb9ed82d17c71ed87) where the task (task ID: fffff
fffffffffffc1021ef3f8a00edeb1f68f9001000000, name=_Inner.__init__, pid=3086920, memory used=0.27GB) was running was 250.91GB / 251
.73GB (0.996739), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 5907a446220ff40f882a29dc7fe3ab0d23
dc3976f00ac88b65b99ffe) because it was the most recently scheduled task; to see more information about memory usage on this node, 
use `ray logs raylet.out -ip 128.208.6.50`. To see the logs of the worker, use `ray logs worker-5907a446220ff40f882a29dc7fe3ab0d23
dc3976f00ac88b65b99ffe*out -ip 128.208.6.50. Top 10 memory users:
PID     MEM(GB) COMMAND
1453756 81.87   ray::_Inner.train
2641626 0.92    /homes/gws/jpgard/anaconda3/envs/tableshift/lib/python3.8/site-packages/ray/core/src/ray/raylet/rayl...
498322  0.45    python experiments/domain_shift.py --num_samples 100 --use_cached --cpu_models_only --experiment acs...
3086920 0.27    ray::_Inner.train
2176767 0.25    ray::_Inner.train
2253982 0.24    ray::_Inner.train
2250585 0.24    ray::_Inner.train
2200373 0.23    ray::_Inner.train
2200266 0.23    ray::_Inner.train
2259255 0.23    ray::_Inner.train
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-pr
evention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set m
ax_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environme
nt variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_mo
nitor_refresh_ms` to zero.

These warnings only start to appear after many trials (hundreds) have been run. However, I am calling ray.init(), running a single Tuner.fit() call with 100 hyperparameter samples, and then calling ray.shutdown(). I would expect this would free up all of the memory consumed in each trial, but it seems that perhaps the memory is not being freed up, is instead accumulating and eventually causing the OOM which brings everything to a halt.

Any thoughts here?