Ray Trainer looking for more CPU's than that of its initialized on

suraj-gade · September 27, 2022, 6:41pm

High: It blocks me to complete my task.

I am trying to train a Tensorflow model using Ray train. Refer to the code below.

def train_func(config):
    CONF, lstm_params= config.values()
     
    start_epoch = 0
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    with strategy.scope():
      model = lstm_model()  # returns a sequential model

      checkpoint = session.get_checkpoint()
      if checkpoint:
        # assume that we have run the session.report() example
        # and successfully save some model weights
        checkpoint_dict = checkpoint.to_dict()
        model.set_weights(checkpoint_dict.get("model_weights"))
        start_epoch = checkpoint_dict.get("epoch", -1) + 1

      model.compile(loss=CONF['lstm_params']['loss'], 
            optimizer=CONF['lstm_params']['optimizer'], 
            metrics=le(CONF['lstm_params']['metrics']))
    
    dataset_shard = session.get_dataset_shard("train")

    def to_tf_dataset(dataset, batch_size):
      def to_tensor_iterator():
        for batch in dataset.iter_tf_batches(batch_size=batch_size, dtypes=tf.float32):
          yield batch["x"], batch["y"]
      
      output_signature = (
      tf.TensorSpec(shape=(None), dtype=tf.float32),
      tf.TensorSpec(shape=(None), dtype=tf.float32),)

      tf_dataset = tf.data.Dataset.from_generator(
          to_tensor_iterator, output_signature=output_signature
      )
      return prepare_dataset_shard(tf_dataset)

    for epoch in range(start_epoch, lstm_params['epochs']):
      tf_dataset = to_tf_dataset(dataset=dataset_shard, batch_size=lstm_params['batch_size'])

      model.fit(tf_dataset, shuffle=lstm_params['shuffle'])

      checkpoint = Checkpoint.from_dict(dict(epoch=epoch, model_weights=model.get_weights()))
      session.report({}, checkpoint=checkpoint) 



config = {...} # hyperparameters for LSTM model 

checkpoint_config = CheckpointConfig(checkpoint_score_attribute="accuracy", checkpoint_score_order="max")

X_train, y_train = read_input_data()  #function returns training data numpy arrays

dataset = ray.data.from_items([{"x": X_train[index,:,:], "y": y_train[index]} for index in range(y_train.shape[0])])

trainer = TensorflowTrainer(train_func, train_loop_config=config,
                            scaling_config=ScalingConfig(num_workers=2),
                            run_config=RunConfig(checkpoint_config=checkpoint_config),
                            datasets={"train": dataset})
result = trainer.fit()

I am running it on Google Colab, where I have 2 CPUs available and I am initializing the the Tensorflow Trainer with 2 workers only.
Still, ray trainer is looking for 3 CPU’s and throwing below warning at trainer.fit() execution.

WARNING insufficient_resources_manager.py:128 -- Ignore this message if the cluster is autoscaling. You asked for 3.0 cpu and 0 gpu per trial, but the cluster only has 2.0 cpu and 0 gpu. Stop the tuning job and adjust the resources requested per trial (possibly via `resources_per_trial` or via `num_workers` for rllib) and/or add more resources to your Ray runtime.

It throws above warning multiple times and after some re-trials execution fails.

Can you please help me understand and resolve the issue here and let me know if I am missing something

amogkam · September 27, 2022, 7:29pm

Hey @suraj-gade, this is expected. Even though you have 2 workers that reserve 1 CPU each, there is an additional CPU that is reserved by the Trainer itself. You can disable the CPU for the Trainer by setting your scaling config like so: scaling_config=ScalingConfig(num_workers=2, trainer_resources={"CPU": 0})

Topic		Replies	Views
Resources not being used Ray Core	4	1297	September 21, 2021
Ray Train hangs for long time Ray Train	11	1773	July 20, 2022
ScalingConfig() num_workers not corresponding to training runs? Ray Train	8	803	February 5, 2024
Most efficient way to use only a CPU for training RLlib	3	3126	April 22, 2021
Question about resource management in Ray Ray Core	22	1943	April 24, 2021

Ray Trainer looking for more CPU's than that of its initialized on

Related topics