W tensorflow/core/data/root_dataset.cc:362] Optimization loop failed: CANCELLED: Operation was cancelled

Hi there all,
I am doing an experiment with LSTM using tensorflow and keras. I have used Ray to run the job in parallel. I tested my code with 4 grid points. It is working fine but when I used the same code over 20 by 20 grids by assigning chunk of size 2 in lat and lon, then frequently I am getting the “W tensorflow/core/data/root_dataset.cc:362] Optimization loop failed: CANCELLED: Operation was cancelled”. I am using Ray, Tensorflow, Keras first time, and facing difficulty to point out what these errors are and how to resolve them.
Note: The error I am getting only when running my code with Ray.

Thanks for the help in advance.

Hi @jagdishisro , can you explain what does the grids refer to? is it a concept in keras?

From the error it seems to be a tensorflow internal error. Probably you can try run the code without Ray and see if it still happen. If not, it’d be good to share a minimal repro for this error so we can better help.

Here grids refers to each location of my data. I have 3D array. I have a 3D array. I am getting this error only when I am running the code with ray specially I am selecting a arraytime,lat,lon and processing it with considering chunk of lat=2, lon=2. When I ran the code on array[:,4,4] with chunk of lat=2 and lon=2.
My code looks like in the following structure.
ray.init(ignore_reinit_error=True)
RAY_memory_monitor_refresh_ms = 0
processing steps

def chunk_indices(array_length, chunk_size):
return [range(i, min(i + chunk_size, array_length)) for i in range(0, array_length, chunk_size)]

def process_chunks(lat_chunks, lon_chunks, sst_train, sst_test, alpha, tau, K, DC, init, tol, look_back, lead, l, spi):

futures = []
for lat_chunk in lat_chunks:
    for lon_chunk in lon_chunks:
        futures.append(process_grid_cell.remote(
            lat_chunk[0], lat_chunk[-1] + 1,
            lon_chunk[0], lon_chunk[-1] + 1,
            sst_train, sst_test, alpha, tau, K, DC, init, tol, look_back, lead, l, spi
        ))

results = ray.get(futures)
return results

Here is the repro
(process_grid_cell pid=187040) WARNING:tensorflow:5 out of the last 3655 calls to <function TensorFlowTrainer.make_predict_function..one_step_on_data_distributed at 0x7f3204481940> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to Better performance with tf.function  |  TensorFlow Core and tf.function  |  TensorFlow v2.16.1 for more details. [repeated 3x across cluster]
(process_grid_cell pid=187034) 2024-08-29 07:04:38.792387: W tensorflow/core/data/root_dataset.cc:362] Optimization loop failed: CANCELLED: Operation was cancelled
(process_grid_cell pid=187075) 2024-08-29 07:07:09.375367: W tensorflow/core/data/root_dataset.cc:362] Optimization loop failed: CANCELLED: Operation was cancelled
(process_grid_cell pid=190332) 15 14
(process_grid_cell pid=190504) 13 14
(process_grid_cell pid=187082) 17 14
(process_grid_cell pid=186968) 15 10
(process_grid_cell pid=191814) 3 2
(process_grid_cell pid=187060) 17 12
(process_grid_cell pid=187044) 19 14
(process_grid_cell pid=191003) 9 16
(process_grid_cell pid=190477) 15 6
(process_grid_cell pid=187015) 19 16
(process_grid_cell pid=187080) 17 18
(process_grid_cell pid=191528) 7 0