GPU memory management with Tf (OOM)

I am currently dealing with an out of memory error (OOM) and cannot really understand how to solve it. I am basically trying to run tensorflow on Ray clients (and train models on cifar100) using two GPU. But whatever the num_gpus value I use, it’s the same issue.

Here the code for the ray init through the Flower framework:

    client_resources = {
            "num_cpus": 1.0,
            "num_gpus": 1.0 / nbClient
    }


    result=fl.simulation.start_simulation(
      client_fn=client_training_fn,
      num_clients=min_available_clients,
      config=fl.server.ServerConfig(num_rounds=FLAGS.num_rounds),
      strategy=strategy,
      ray_init_args=ray_server_config,
      client_resources=client_resources,
    )

And the output:

INFO flwr 2024-03-29 17:41:57,120 | app.py:242 | Flower VCE: Resources for each Virtual Client: {'num_cpus': 1.0, 'num_gpus': 0.009900990099009901}

I run 100 clients, so I tried to split the gpu 100 times.

2024-04-02 15:43:02.775012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31141 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2024-04-02 15:43:02.775290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31141 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1c:00.0, compute capability: 7.0

2 GPU are detected (good)

And then… OOM

2024-04-02 15:43:26.368100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 460 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
(DefaultActor pid=39424) 2024-04-02 15:43:26.376283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 460 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute c
apability: 7.0
(DefaultActor pid=39421) 2024-04-02 15:43:26.881724: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:371] A non-primary context 0x9ee6fc0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified Stream
Executor works with that.

(...error stack...)

  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/device_lib.py", line 41, in list_local_devices
    _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Bad StatusOr access: INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34079899648

(...)

RuntimeError: Bad StatusOr access: INTERNAL: faile
d initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34079899648\n',)
(DefaultActor pid=39423) 2024-04-02 15:43:27.119271: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 166.62MiB (174718976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119400: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 149.96MiB (157247232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119499: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 134.97MiB (141522688 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119591: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 121.47MiB (127370496 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119682: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 109.32MiB (114633472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119772: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 98.39MiB (103170304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119871: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 88.55MiB (92853504 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.119965: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 79.70MiB (83568384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120057: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 71.73MiB (75211776 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120148: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 64.55MiB (67690752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120240: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 58.10MiB (60921856 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120341: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 52.29MiB (54829824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120433: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 47.06MiB (49347072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120524: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 42.35MiB (44412416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120615: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 38.12MiB (39971328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120705: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 34.31MiB (35974400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120798: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 30.88MiB (32377088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120889: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 27.79MiB (29139456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.120980: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 25.01MiB (26225664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121070: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 22.51MiB (23603200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121162: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 20.26MiB (21242880 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121262: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 18.23MiB (19118592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 16.41MiB (17206784 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121446: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 14.77MiB (15486208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121537: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 13.29MiB (13937664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121628: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 11.96MiB (12544000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121719: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 10.77MiB (11289600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39423) 2024-04-02 15:43:27.121814: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:746] failed to allocate 166.62MiB (174718976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(DefaultActor pid=39424) 2024-04-02 15:43:28.518985: F ./tensorflow/core/kernels/random_op_gpu.h:247] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), key, counter, gen, data, size, dist) status: INTERNA
L: out of memory
(DefaultActor pid=39424) *** SIGABRT received at time=1712065408 on cpu 4 ***
(DefaultActor pid=39424) PC: @     0x14b990d15a7c  (unknown)  pthread_kill
(DefaultActor pid=39424)     @     0x14b990cc1520  (unknown)  (unknown)
(DefaultActor pid=39424) [2024-04-02 15:43:28,524 E 39424 39424] logging.cc:361: *** SIGABRT received at time=1712065408 on cpu 4 ***
(DefaultActor pid=39424) [2024-04-02 15:43:28,524 E 39424 39424] logging.cc:361: PC: @     0x14b990d15a7c  (unknown)  pthread_kill
(DefaultActor pid=39424) [2024-04-02 15:43:28,524 E 39424 39424] logging.cc:361:     @     0x14b990cc1520  (unknown)  (unknown)
(DefaultActor pid=39424) Fatal Python error: Aborted

I also wonder why a device is created with 460MB when GPU0 is 32GB.
Created device /job:localhost/replica:0/task:0/device:GPU:0 with 460 MB memory: -> device: 0

Several answers on the web say that the batchsize should be reduced in case of OOM. However, my batchsize is already set on 1, so I don’t think it comes from that.

Any idea of how to solve this issue ? What could be the reasons of this ?