How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello,
Hi,
In order to avoid a CPU bottleneck to consume GPU budget, a common idea is to offload dataloading to a fleet of CPU instances, and send clean batches to the fleet of GPU instances. This is for example documented here:
TF has the TF Data Service for this. PyTorch doesn’t seem to have something built-in yet (feasible manually though: the SageMaker PT sample does it by creating a custom Dataset class pulling records from a grpc service )
Does Ray simplifies this pattern? Using Ray (Core, Train or Data), how would one run data parallel training on a fleet of GPU instances, and dataset reading + batching on a fleet of CPU instances?
Hi @Lacruche thanks for your question. This is indeed a pattern that’s becoming more successful, and Ray excels at supporting this architecture.
At high level, this is how it could work:
Start a Ray cluster with heterogeneous resources: a fleet of CPU nodes and a fleet of GPU nodes
Use Ray Data for large-scale data ingest on CPU nodes, and Ray Train for distributed training on GPU nodes. In particular:
– The data from ingest to training is exchanged/transferred via Ray’s distributed memory, without having to materialize onto external disk-based storage system, so could achieve significantly better performance.
– The ingest + training can be written in a single unified Python program and run on a single system (i.e. Ray cluster), as opposed to writing multiple programs and glueing across multiple systems.
TL;DR; You can try with assigning tiny num_cpus for your trainer workers in AIR to reduce # of data tasks scheduled on it.
Longer response:
We’ve been pushing for a “node tag” / “resource tag” concept in Ray scheduling but sadly it’s not priority yet. What I found most productive is simply reduce # of CPUs assigned to your GPU hosts.
Behind the scene, ray’s “resource” is simply a logical tag for Ray scheduler, not hardware or cgroup level isolation. Thus giving num_cpus=0 to headnode disables scheduling, and reducing # CPUs allocated to GPU hosts reduce it’s CPU task responsibility.
Within dataset internal, since it’s task-based implementation, it will start by determining parallelism based on # of CPU available and assign read + preprocess tasks to it – which tasks produce dataset blocks as a unit of data.
Meanwhile Ray AIR’s trainer doesn’t use much logical CPU either (I think trainer only uses 1), thus assigning less CPUs on your GPU host ~= less blocks and data tasks on GPU host ~= shifted data processing tasks more on heterogenous CPU hosts
Internally As you can see the trick above likely suffice for most cases even though it’s not intuitive or well documented, which may or may not lead to an API change soon, but we’ll see.