Can Ray Dataset facilitate training on heterogeneous clusters?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

Hi,

In order to avoid a CPU bottleneck to consume GPU budget, a common idea is to offload dataloading to a fleet of CPU instances, and send clean batches to the fleet of GPU instances. This is for example documented here:

TF has the TF Data Service for this. PyTorch doesn’t seem to have something built-in yet (feasible manually though: the SageMaker PT sample does it by creating a custom Dataset class pulling records from a grpc service )

Does Ray simplifies this pattern? Using Ray (Core, Train or Data), how would one run data parallel training on a fleet of GPU instances, and dataset reading + batching on a fleet of CPU instances?

Hi @Lacruche thanks for your question. This is indeed a pattern that’s becoming more successful, and Ray excels at supporting this architecture.

At high level, this is how it could work:

  • Start a Ray cluster with heterogeneous resources: a fleet of CPU nodes and a fleet of GPU nodes
  • Use Ray Data for large-scale data ingest on CPU nodes, and Ray Train for distributed training on GPU nodes. In particular:
    – The data from ingest to training is exchanged/transferred via Ray’s distributed memory, without having to materialize onto external disk-based storage system, so could achieve significantly better performance.
    – The ingest + training can be written in a single unified Python program and run on a single system (i.e. Ray cluster), as opposed to writing multiple programs and glueing across multiple systems.
  • For using Ray Data for scalable ingest: Example: Large-scale ML Ingest — Ray 2.1.0
  • For using Ray Train for scalable training: Ray Train: Scalable Model Training — Ray 3.0.0.dev0
  • Also notably, Ray is framework-agnostic, so the above works for both TF and PyTorch.
  • Finally, there are real success use cases from industry-leading companies which you may take a look, e.g. Uber’s heterogeneous Ray cluster from Ray Summit 2022: RaySummit'22--Large Scale Deep Learning Training and Tuning with Ray at Uber.pdf - Google Drive

If you want to dive into more details, here is an end-to-end example which is running as a release test for Ray:

1 Like

exciting thanks I’ll take a look!

Hi,

Can this be achieved with Ray AIR ?

I’ve reviewed the docs, and related source code:

And it’s unclear to me how to configure/force the Dataset preprocess so it runs on a CPU worker, and Trainer on a GPU worker.

Please note that in the Cluster config example provided above, the GPU node is of type i3.8xlarge which has no GPU:

[ i’m a new user, and thus not allowed to add more links :\ ]

Thanks in advance,
Harel

1 Like

TL;DR; You can try with assigning tiny num_cpus for your trainer workers in AIR to reduce # of data tasks scheduled on it.

Longer response:

We’ve been pushing for a “node tag” / “resource tag” concept in Ray scheduling but sadly it’s not priority yet. What I found most productive is simply reduce # of CPUs assigned to your GPU hosts.

Behind the scene, ray’s “resource” is simply a logical tag for Ray scheduler, not hardware or cgroup level isolation. Thus giving num_cpus=0 to headnode disables scheduling, and reducing # CPUs allocated to GPU hosts reduce it’s CPU task responsibility.

Within dataset internal, since it’s task-based implementation, it will start by determining parallelism based on # of CPU available and assign read + preprocess tasks to it – which tasks produce dataset blocks as a unit of data.

Meanwhile Ray AIR’s trainer doesn’t use much logical CPU either (I think trainer only uses 1), thus assigning less CPUs on your GPU host ~= less blocks and data tasks on GPU host ~= shifted data processing tasks more on heterogenous CPU hosts

1 Like

Thank you very much @Jiao_Dong .We’ll play with it and see how the resources are utilized.

Regarding -

We’ve been pushing for a “node tag” / “resource tag” concept in Ray scheduling but sadly it’s not priority yet

You mean internally or publically ?

Internally :slight_smile: As you can see the trick above likely suffice for most cases even though it’s not intuitive or well documented, which may or may not lead to an API change soon, but we’ll see.

1 Like