Approaches to managing multi-user ray clusters

Severity

  • None

Hello Ray community. I have been using Ray mostly for parallelizing linear algebra code on small clusters (a handful of VMs).

I am now exploring how to make ray available to the rest of my research team, and am wondering how other folks have approached this. Currently we manage several other ‘Job’ based cluster compute services like AWS Batch, Slurm, K8s either for production services or to allow researchers to schedule work, with various amounts of support for Fair scheduling, priority, preemption, etc… I am wondering what patterns do people use to make Ray available to a team of users (multi-tenant) with considerations for limiting the resources each user has, scheduling with priority (super long training runs use resources when they are available, but short term jobs can take resources), one persons job doesn’t balloon and halt the cluster, etc…?

Seems like there are quite a few layers at which one could try to implement these things like:

  • A single ray cluster where users interact via client, using mechanisms like placement groups / resources
  • A single ray cluster where users interact via Jobs API
  • A ray cluster where users interact with a 3rd party scheduler that then submits their work to an exisiting ray cluster
  • KubeRay Jobs or MCAD, where resource gating / scheduling happens at the K8s level and each user has their own cluster spun up
  • Ray on Slurm / Batch / etc - Each user schedules a job to another batch / scheduler which allocates resources for the entire cluster to them. User manages initializing Ray, but is the sole tenant of the ray cluster.

Which patterns are people generally using for managing shared access to Ray resources for a team?

1 Like

Hey @jacksonwb, thanks a bunch for posting here!

You’re right that there’s a variety of different options here. Generally I would say:

  1. Single Ray cluster via jobs API would be sufficient if you just wanted jobs to run at some point. That being said, it seems like you probably have more sophisticated needs here.
  2. Generally speaking, I would personally be a bit hesitant to use 1 single Ray cluster across multiple tenants running applications at the same time - most libraries and integrations that I’ve worked on tend to assume single tenancy
  3. If you’re on the cloud, one thing that we do internally is we let everyone just launch their own clusters as needed (using previously the cluster launcher and now Anyscale). You wouldn’t need to deal with prioritization this way.
  4. If you’re on prem, something like Kubernetes Jobs or Slurm (an external scheduler that does resource gating and prioritization) would be best if you need that sort of more advanced scheduling.

Hope that helps! Can I ask if you’re on prem or on the cloud?

1 Like

I’m not super experienced on the cloud and I’m relatively new to ray however I’ll share my experiences on how I’ve used ray on a fairly large (university facilities) cluster.

We use slurm as mentioned by yourself and @rliaw . Each user launches an instance of ray for a job and can allocate resources. It will then schedule the jobs so that the highest priority jobs that can use the available resources will run.

To tackle your concern of one job “balooning” and blocking other jobs, you can set a maximum time limit for a job after which it will be aborted. Combining this with checkpointing which some aspects (if not all) of ray supports will let you pick up the longer job where you left off.

For example, if you have a job that will take 2 days and requires the full resources of the cluster you can set it to run 12 times with a time limit of 4 hours per task. During the first 4 hours, if other jobs are queued, they will be scheduled to run after that first iteration and when those finish the longer job will keep running. If there are no other jobs queued then it will just continue straight away.

Again, as with most things on the internet, take this with a grain of salt but hopefully I helped to shed some light on the issue :slight_smile:

Thanks @rliaw and @theo.

I think it makes the most sense to have users schedule resources and instantiate a ray cluster with another scheduler, with the intention that each user has sole tenancy on their ray cluster.

Currently we are on the cloud, but very possible we will have some on prem resources as well in the future.

A few questions:

  • Given that most workloads expect single tenancy what is the typical use case of the Jobs API? Is it simply to allow a user to ‘submit’ a job to their cluster, instead of having a continuously active driver script?
  • What checkpointing capability does Ray provide? @theo Are you referring to the AIR Checkpoint object?

To be honest I don’t know the full extent of checkpointing in ray, perhaps someone more qualified can provide more information. I do know that ray tune can provide checkpointing to restore a run.