Is `--exclusive` option necessary when deploying on SLURM?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a simple Python script that does a computation on millions of images and saves the result to disk. I have access to a cluster that uses SLURM. The problem is that if I ask for whole nodes, the wait time is signficant, but if I only ask for 4 CPUs per node, I can almost instantly access 10 nodes (4 CPUs each). I’m not expert in SLURM, but I assume the --exclusive tag reserves the whole node (?). I wonder if I can get away without it (i.e., using a fraction of each node).

Thank you so much!

@tupui might have advice here.