Running the ray training example got error

i saw an example in ray doc and i wanted to test it.
import json
import os

def train_func_distributed():
per_worker_batch_size = 64
# This environment variable will be set by Ray Train.
tf_config = json.loads(os.environ[‘TF_CONFIG’])
num_workers = len(tf_config[‘cluster’][‘worker’])

strategy = tf.distribute.MultiWorkerMirroredStrategy()

global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_dataset(global_batch_size)

with strategy.scope():
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = build_and_compile_cnn_model()

multi_worker_model.fit(multi_worker_dataset, epochs=30, steps_per_epoch=70)

rom ray.train.tensorflow import TensorflowTrainer

from ray.train import ScalingConfig

For GPU Training, set use_gpu to True.

use_gpu = True

trainer = TensorflowTrainer(train_func_distributed, scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu, resources_per_worker={“GPU”:0.1, “CPU”:1}))

trainer.fit()

I am running this in laptop :n552vw with 1GPU 4GB and 8 CPU
i am getting this error always :
(raylet) bash: no version information available (required by bash) failed to allocate 3.06GiB (3286263296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.414099: failed to allocate 2.75GiB (2957636864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.419954:
failed to allocate 2.48GiB (2661873152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.427097: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.23GiB (2395685888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.434241: failed to allocate 2.01GiB (2156117248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

(RayTrainWorker pid=27318) Epoch 1/30

(RayTrainWorker pid=27318) 2023-10-23 13:56:18.077496: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8700 (RayTrainWorker pid=27319) 2023-10-23 13:56:09.078657: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at [repeated 27x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see Configuring Logging — Ray 3.0.0.dev0 for more options.) (RayTrainWorker pid=27319) 2023-10-23 13:56:09.078874: I Created device /device:GPU:0 with 3482 MB memory: → device: 0, name: NVIDIA GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0 [repeated 2x across cluster] (RayTrainWorker pid=27319) 2023-10-23 13:56:09.097412: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:449] Started server with target: grpc://192.168.10.238:40809 (RayTrainWorker pid=27318) 2023-10-23 13:56:09.289532: I tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:298] Coordination agent has successfully connected. (RayTrainWorker pid=27319) 2023-10-23 13:56:13.679728: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory. [repeated 7x across cluster] (RayTrainWorker pid=27319) 2023-10-23

0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_1078] 2023-10-23 13:56:19,409 ERROR tune.py:1139 – Trials did not complete: [TensorflowTrainer_860b3_00000] 2023-10-23 13:56:19,412 INFO tune.py:1143 – Total run time: 35.12 seconds (34.96 seconds for the tuning loop).

what shoud i do?

The error message you’re seeing indicates that your GPU is running out of memory during the execution of your program. This is likely because the model you’re trying to train is too large to fit into your GPU’s memory.

Here are a few potential solutions:

  1. Decrease the batch size: The batch size is a parameter that determines the number of samples to work through before updating the internal model parameters. A smaller batch size requires less memory, as fewer gradients need to be stored for computation. However, it might make the training process slower and the final results less accurate.

  2. Use a smaller model: If you’re using a pre-trained model, you might want to consider using a smaller version of it. Smaller models require less memory but might be less accurate.

  3. Use a GPU with more memory: If possible, you might want to consider upgrading to a GPU with more memory.

  4. Use model parallelism: For advanced users working with large models, you can use model parallelism to shard the model across multiple GPUs. This can help to alleviate memory issues but requires more complex code and multiple GPUs.

In your case, you might want to try decreasing the batch size first, as it’s the easiest solution to implement. If that doesn’t work, you can consider the other options.

For more information, you can refer to the Ray documentation on handling GPU out-of-memory failures.