Ray actors with tf logical gpus

MakGulati · October 13, 2022, 9:11pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I was trying to integrate gpu logical partition with ray core actors for parallel gpu computation using single gpu with 32GB VRAM

PARALLEL_CALLS=10
@ray.remote(num_gpus=1 / PARALLEL_CALLS)
class AsyncClients:
    def __init__(self, cid):
        gpus = tf.config.list_physical_devices("GPU")
        if gpus:

            try:
                tf.config.set_logical_device_configuration(
                    gpus[0],
                    [
                        tf.config.LogicalDeviceConfiguration(memory_limit=512)
                        for _ in range(PARALLEL_CALLS)
                    ],
                )
                logical_gpus = tf.config.list_logical_devices("GPU")

            except RuntimeError as e:
                # Virtual devices must be set before GPUs have been initialized
                print(e)

        self.cid = cid
        self.device = logical_gpus[self.cid].name
        self.model = client_model(self.cid)
        x_train, y_train = client_data(self.cid, NUM_CLIENTS)
        split_idx = math.floor(len(x_train) * 0.9)  # Use 10% of x_train for validation
        self.x_train, self.y_train = x_train[:split_idx], y_train[:split_idx]
        self.x_val, self.y_val = x_train[split_idx:], y_train[split_idx:]

    def get_parameters(self, config):
        return self.model.get_weights()

    def fit_and_evaluate(self, parameters, config):
        with tf.device(self.device):
            # train
            self.model.set_weights(parameters)
            self.model.fit(
                self.x_train, self.y_train, epochs=config["epochs"], verbose=2
            )
            # evaluate
            loss, acc = self.model.evaluate(self.x_val, self.y_val, verbose=2)

Here, I divide entire my GPU RAM to use with ray but unfortunately I got into memory related errors.
Kindly help to resolve this issue of parallel training processes using single GPU.
Thanks a lot

Chen_Shen · October 15, 2022, 6:17pm

@MakGulati mind sharing what error you are getting? Looking at the code it seems

the init code will be called PARALLEL_CALLS times on the same GPU.
it’s also possible that the virtual devices is not well supported by Ray
or you are simply running out of GPU memory.

zhz · December 2, 2022, 5:48am

Thanks for reporting the issue @MakGulati

I’m going to mark this as resolved since @Chen_Shen provided directions to troubleshoot. Once you have a chance to try these we should reopen the discussion.

Topic		Replies	Views
Ray Actor not utilising GPU Ray Core	7	24	November 6, 2024
Actor running on gpu Ray Core	1	403	August 4, 2022
Ray actor multiple gpu available but only one used Ray Core	3	23	October 4, 2024
How to define `num_gpus` in `ray.remote()` while not explicitly adding `@ray.remote` above the target class Ray Core	2	133	April 16, 2024
Can't work with pytorch's gpu tensor	0	5	September 20, 2024

Ray actors with tf logical gpus

Related topics