Why increasing the number of parallel GPU tasks make it faster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, I am wondering what is the actual difference when using different num_gpus. If I set num_gpus=0.1, there will be 10 processes in GPU (observed in nvidia-smi), but according to Running more than one CUDA applications on one GPU - Stack Overflow , different processes cannot run parallelly.

I use the follwing function with num_gpus=0.1 and num_gpus=1, the result shows that the former cost less time, 2.5s vs 8s. How can it gain performance improvement since functions cannot run in parallel? I guess it is because when num_gpus=0.1, the time of creating cuda context is overlap. What do you think?

@ray.remote(num_gpus=0.1)
def f(datatype):
    # Do some work
    x = np.random.random(1000000,datatype)
    for i in range(1):
        x += 1

    return x

All cpu numbers and gpu numbers are logic resources. If you set num_gpus=0.1, it’ll start 10 process.

You code doesn’t use GPU and they’ll run in parallel.

sorry that I miss the line import cupy as np here, actually it’s x = cupy.random.random(1000000,datatype), so this code does use GPU (confirmed by using nsight compute).

I know they are logic resources, so I think perhaps when I set num_gpus=0.1, ten cuda context will be created immediately. And if num_gpus=1, they only start to create cuda context after the former task finished. Is it this reason to make their performance different?