i saw an example in ray doc and i wanted to test it.
import json
import os
def train_func_distributed():
per_worker_batch_size = 64
# This environment variable will be set by Ray Train.
tf_config = json.loads(os.environ[‘TF_CONFIG’])
num_workers = len(tf_config[‘cluster’][‘worker’])
strategy = tf.distribute.MultiWorkerMirroredStrategy()
global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_dataset(global_batch_size)
with strategy.scope():
# Model building/compiling need to be within `strategy.scope()`.
multi_worker_model = build_and_compile_cnn_model()
multi_worker_model.fit(multi_worker_dataset, epochs=30, steps_per_epoch=70)
rom ray.train.tensorflow import TensorflowTrainer
from ray.train import ScalingConfig
For GPU Training, set use_gpu
to True.
use_gpu = True
trainer = TensorflowTrainer(train_func_distributed, scaling_config=ScalingConfig(num_workers=2, use_gpu=use_gpu, resources_per_worker={“GPU”:0.1, “CPU”:1}))
trainer.fit()
I am running this in laptop :n552vw with 1GPU 4GB and 8 CPU
i am getting this error always :
(raylet) bash: no version information available (required by bash) failed to allocate 3.06GiB (3286263296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.414099: failed to allocate 2.75GiB (2957636864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.419954:
failed to allocate 2.48GiB (2661873152 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.427097: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.23GiB (2395685888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory (RayTrainWorker pid=27319) 2023-10-23 13:56:09.434241: failed to allocate 2.01GiB (2156117248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(RayTrainWorker pid=27318) Epoch 1/30
(RayTrainWorker pid=27318) 2023-10-23 13:56:18.077496: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8700 (RayTrainWorker pid=27319) 2023-10-23 13:56:09.078657: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at [repeated 27x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see Configuring Logging — Ray 3.0.0.dev0 for more options.) (RayTrainWorker pid=27319) 2023-10-23 13:56:09.078874: I Created device /device:GPU:0 with 3482 MB memory: → device: 0, name: NVIDIA GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0 [repeated 2x across cluster] (RayTrainWorker pid=27319) 2023-10-23 13:56:09.097412: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:449] Started server with target: grpc://192.168.10.238:40809 (RayTrainWorker pid=27318) 2023-10-23 13:56:09.289532: I tensorflow/tsl/distributed_runtime/coordination/coordination_service_agent.cc:298] Coordination agent has successfully connected. (RayTrainWorker pid=27319) 2023-10-23 13:56:13.679728: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 188160000 exceeds 10% of free system memory. [repeated 7x across cluster] (RayTrainWorker pid=27319) 2023-10-23
0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_1078] 2023-10-23 13:56:19,409 ERROR tune.py:1139 – Trials did not complete: [TensorflowTrainer_860b3_00000] 2023-10-23 13:56:19,412 INFO tune.py:1143 – Total run time: 35.12 seconds (34.96 seconds for the tuning loop).
what shoud i do?