Can Ray Dataset facilitate training on heterogeneous clusters?

Lacruche · December 8, 2022, 6:14pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

Hi,

In order to avoid a CPU bottleneck to consume GPU budget, a common idea is to offload dataloading to a fleet of CPU instances, and send clean batches to the fleet of GPU instances. This is for example documented here:

TF has the TF Data Service for this. PyTorch doesn’t seem to have something built-in yet (feasible manually though: the SageMaker PT sample does it by creating a custom Dataset class pulling records from a grpc service )

Does Ray simplifies this pattern? Using Ray (Core, Train or Data), how would one run data parallel training on a fleet of GPU instances, and dataset reading + batching on a fleet of CPU instances?

jianxiao · December 8, 2022, 7:03pm

Hi @Lacruche thanks for your question. This is indeed a pattern that’s becoming more successful, and Ray excels at supporting this architecture.

At high level, this is how it could work:

Start a Ray cluster with heterogeneous resources: a fleet of CPU nodes and a fleet of GPU nodes
Use Ray Data for large-scale data ingest on CPU nodes, and Ray Train for distributed training on GPU nodes. In particular:
– The data from ingest to training is exchanged/transferred via Ray’s distributed memory, without having to materialize onto external disk-based storage system, so could achieve significantly better performance.
– The ingest + training can be written in a single unified Python program and run on a single system (i.e. Ray cluster), as opposed to writing multiple programs and glueing across multiple systems.
For using Ray Data for scalable ingest: Example: Large-scale ML Ingest — Ray 2.1.0
For using Ray Train for scalable training: Ray Train: Scalable Model Training — Ray 3.0.0.dev0
Also notably, Ray is framework-agnostic, so the above works for both TF and PyTorch.
Finally, there are real success use cases from industry-leading companies which you may take a look, e.g. Uber’s heterogeneous Ray cluster from Ray Summit 2022: RaySummit'22--Large Scale Deep Learning Training and Tuning with Ray at Uber.pdf - Google Drive

If you want to dive into more details, here is an end-to-end example which is running as a release test for Ray:

Lacruche · December 8, 2022, 7:55pm

exciting thanks I’ll take a look!

harelwa · December 25, 2022, 1:35pm

Hi,

Can this be achieved with Ray AIR ?

I’ve reviewed the docs, and related source code:

And it’s unclear to me how to configure/force the Dataset preprocess so it runs on a CPU worker, and Trainer on a GPU worker.

Please note that in the Cluster config example provided above, the GPU node is of type i3.8xlarge which has no GPU:

[ i’m a new user, and thus not allowed to add more links :\ ]

Thanks in advance,
Harel

Jiao_Dong · December 26, 2022, 4:55am

TL;DR; You can try with assigning tiny num_cpus for your trainer workers in AIR to reduce # of data tasks scheduled on it.

Longer response:

We’ve been pushing for a “node tag” / “resource tag” concept in Ray scheduling but sadly it’s not priority yet. What I found most productive is simply reduce # of CPUs assigned to your GPU hosts.

Behind the scene, ray’s “resource” is simply a logical tag for Ray scheduler, not hardware or cgroup level isolation. Thus giving num_cpus=0 to headnode disables scheduling, and reducing # CPUs allocated to GPU hosts reduce it’s CPU task responsibility.

Within dataset internal, since it’s task-based implementation, it will start by determining parallelism based on # of CPU available and assign read + preprocess tasks to it – which tasks produce dataset blocks as a unit of data.

Meanwhile Ray AIR’s trainer doesn’t use much logical CPU either (I think trainer only uses 1), thus assigning less CPUs on your GPU host ~= less blocks and data tasks on GPU host ~= shifted data processing tasks more on heterogenous CPU hosts

harelwa · December 26, 2022, 3:14pm

Thank you very much @Jiao_Dong .We’ll play with it and see how the resources are utilized.

Regarding -

We’ve been pushing for a “node tag” / “resource tag” concept in Ray scheduling but sadly it’s not priority yet

You mean internally or publically ?

Jiao_Dong · December 26, 2022, 6:43pm

Internally As you can see the trick above likely suffice for most cases even though it’s not intuitive or well documented, which may or may not lead to an API change soon, but we’ll see.

Topic		Replies	Views
Parallel inference using CPUs Ray Core	2	834	July 7, 2023
Can you copy data between two node's CPU and GPU Kubernetes	5	731	July 12, 2021
[Train, Tune, Cluster] Handling different GPUs (with different GPU memories) in a Ray Cluster Ray Clusters	0	353	February 2, 2023
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1644	July 28, 2022
[Ray Train] [Ray Tune] [Ray Clusters] Handling different GPUs (with different GPU memory sizes) in a Ray cluster)	0	455	February 2, 2023

Can Ray Dataset facilitate training on heterogeneous clusters?

Related topics