Dataset support concurrency in one block when using map_batches

Basasuya · September 5, 2022, 12:09pm

Hi, after using ray dataset in large-scale inference.
I find block number is the max parallelism in map_batches if we assume resource is unlimited.

The batch_size parameter in map_batches can only determine the solving batch size in one BlockWorker (the transform inner block is serial), which means can not accelerate the code through tuning batch_size.

I wish the max parallelism in map_batches is determined by batch_size. each batch can be solved in concurrent. Is this reasonable?

reference from:

github.com/ray-project/ray

python/ray/data/dataset.py

master


      
              **ray_remote_args,
          ) -> "Dataset":
              """Apply the given function to batches of data.
          
              This method is useful for preprocessing data and performing inference. To learn
              more, see :ref:`Transforming batches <transforming_batches>`.
          
              You can use either a function or a callable class to perform the transformation.
              For functions, Ray Data uses stateless Ray tasks. For classes, Ray Data uses
              stateful Ray actors. For more information, see
              :ref:`Stateful Transforms <stateful_transforms>`.
          
              .. tip::
                  To understand the format of the input to ``fn``, call :meth:`~Dataset.take_batch`
                  on the dataset to get a batch in the same format as will be passed to ``fn``.
          
              .. tip::
                  If ``fn`` doesn't mutate its input, set ``zero_copy_batch=True`` to improve
                  performance and decrease memory utilization.
          
              .. warning::

github.com/ray-project/ray

python/ray/data/_internal/compute.py

master


      
          import logging
          from typing import Any, Callable, Iterable, Optional, TypeVar, Union
          
          from ray.data._internal.execution.interfaces import TaskContext
          from ray.data.block import Block, UserDefinedFunction
          from ray.util.annotations import DeveloperAPI
          
          logger = logging.getLogger(__name__)
          
          T = TypeVar("T")
          U = TypeVar("U")
          
          
          # Block transform function applied by task and actor pools.
          BlockTransform = Union[
              # TODO(Clark): Once Ray only supports Python 3.8+, use protocol to constrain block
              # transform type.
              # Callable[[Block, ...], Iterable[Block]]
              # Callable[[Block, UserDefinedFunction, ...], Iterable[Block]],
              Callable[[Iterable[Block], TaskContext], Iterable[Block]],
              Callable[[Iterable[Block], TaskContext, UserDefinedFunction], Iterable[Block]],
              Callable[..., Iterable[Block]],
          ]
          
          
          @DeveloperAPI
          class ComputeStrategy:
              pass
          
          
          @DeveloperAPI
          class TaskPoolStrategy(ComputeStrategy):
              def __init__(
                  self,
                  size: Optional[int] = None,
              ):
                  """Construct TaskPoolStrategy for a Dataset transform.
          
                  Args:
                      size: Specify the maximum size of the task pool.
                  """
          
                  if size is not None and size < 1:
                      raise ValueError("`size` must be >= 1", size)
                  self.size = size
          
              def __eq__(self, other: Any) -> bool:
                  return (isinstance(other, TaskPoolStrategy) and self.size == other.size) or (
                      other == "tasks" and self.size is None
                  )
          
          
          class ActorPoolStrategy(ComputeStrategy):
              """Specify the compute strategy for a Dataset transform.
          
              ActorPoolStrategy specifies that an autoscaling pool of actors should be used
              for a given Dataset transform. This is useful for stateful setup of callable
              classes.
          
              For a fixed-sized pool of size ``n``, specify ``compute=ActorPoolStrategy(size=n)``.
              To autoscale from ``m`` to ``n`` actors, specify
              ``ActorPoolStrategy(min_size=m, max_size=n)``.
          
              To increase opportunities for pipelining task dependency prefetching with
              computation and avoiding actor startup delays, set max_tasks_in_flight_per_actor
              to 2 or greater; to try to decrease the delay due to queueing of tasks on the worker
              actors, set max_tasks_in_flight_per_actor to 1.
              """
          
              def __init__(
                  self,
                  *,
                  size: Optional[int] = None,
                  min_size: Optional[int] = None,
                  max_size: Optional[int] = None,
                  max_tasks_in_flight_per_actor: Optional[int] = None,
              ):
                  """Construct ActorPoolStrategy for a Dataset transform.
          
                  Args:
                      size: Specify a fixed size actor pool of this size. It is an error to
                          specify both `size` and `min_size` or `max_size`.
                      min_size: The minimize size of the actor pool.
                      max_size: The maximum size of the actor pool.
                      max_tasks_in_flight_per_actor: The maximum number of tasks to concurrently
                          send to a single actor worker. Increasing this will increase
                          opportunities for pipelining task dependency prefetching with
                          computation and avoiding actor startup delays, but will also increase
                          queueing delay.
                  """
                  if size is not None:
                      if size < 1:
                          raise ValueError("size must be >= 1", size)
                      if max_size is not None or min_size is not None:
                          raise ValueError(
                              "min_size and max_size cannot be set at the same time as `size`"
                          )
                      min_size = size
                      max_size = size
                  if min_size is not None and min_size < 1:
                      raise ValueError("min_size must be >= 1", min_size)
                  if max_size is not None:
                      if min_size is None:
                          min_size = 1  # Legacy default.
                      if min_size > max_size:
                          raise ValueError("min_size must be <= max_size", min_size, max_size)
                  if (
                      max_tasks_in_flight_per_actor is not None
                      and max_tasks_in_flight_per_actor < 1
                  ):
                      raise ValueError(
                          "max_tasks_in_flight_per_actor must be >= 1, got: ",
                          max_tasks_in_flight_per_actor,
                      )
                  self.min_size = min_size or 1
                  self.max_size = max_size or float("inf")
                  self.max_tasks_in_flight_per_actor = max_tasks_in_flight_per_actor
                  self.num_workers = 0
                  self.ready_to_total_workers_ratio = 0.8
          
              def __eq__(self, other: Any) -> bool:
                  return isinstance(other, ActorPoolStrategy) and (
                      self.min_size == other.min_size
                      and self.max_size == other.max_size
                      and self.max_tasks_in_flight_per_actor
                      == other.max_tasks_in_flight_per_actor
                  )
          
          
          def get_compute(compute_spec: Union[str, ComputeStrategy]) -> ComputeStrategy:
              if not isinstance(compute_spec, (TaskPoolStrategy, ActorPoolStrategy)):
                  raise ValueError(
                      "In Ray 2.5, the compute spec must be either "
                      f"TaskPoolStrategy or ActorPoolStrategy, was: {compute_spec}."
                  )
              elif not compute_spec or compute_spec == "tasks":
                  return TaskPoolStrategy()
              elif compute_spec == "actors":
                  return ActorPoolStrategy()
              elif isinstance(compute_spec, ComputeStrategy):
                  return compute_spec
              else:
                  raise ValueError("compute must be one of [`tasks`, `actors`, ComputeStrategy]")

This file has been truncated. show original

matthewdeng · September 11, 2022, 1:11am

Hey @Basasuya, can you share more about your use-case and if you’re running into some performance issues because of this?

If you have a few large blocks and as a result your parallelization is limited by the number of blocks as opposed to resources, one option is to repartition your dataset first so that you have more blocks that are each smaller.

Basasuya · September 15, 2022, 1:09pm

I have hundreds of files, and each file would be large (serveral GB), my code is like below:

 dataset = ray.data.read_binary_files(paths=input_path).window(blocks_per_window=5)
.map_batches(parse_db_line, batch_size=None, compute=ActorPoolStrategy(5,5,1)
.map_batches(predictor, batch_size=None, compute=ActorPoolStrategy(5,5,1), num_gpus=1, num_cpus=1)
.write_json(output_path)

I think one solution is using repartition_each_window, which is the alltoall stage?
maybe we can support a onetoone stage repartition in the future which would run faster

jianxiao · September 19, 2022, 11:28pm

@Basasuya As your understanding, the parallelism is determined by the number of blocks for a Dataset. There is no plan to support parallelization for batches within a block. However, there are a few things to call out:

In processing a batch, the execution is vectorized (by leveraging e.g. the compute kernel of Arrow), so there are parallelization at hardware level already
As mentioned above by @matthewdeng , you can use repartition() to increase the number of blocks. You’re right repartition() is an all-to-all operation, but if you don’t request for shuffle (by default it’s disabled) during repartition, it’s actually quite efficient, likely only involving block splitting (large block split into smaller ones)
If possible, you may shard your input files into smaller ones (e.g. if you are using Spark upstream, that should be doable to repartition before writing out to files)
If not possible, the good news is that we do have a plan to build a feature called Dynamic Block Splitting, which means you can produce multiple (small) blocks from a single input file as you read them into Dataset. It’s estimated to arrive in Ray 2.1 release, stay tuned!

Basasuya · October 1, 2022, 1:35pm

@jianxiao @matthewdeng
thank you for reply
I think repartition_each_window is useful for me.
Looking forward to using Dynamic Block Splitting in future release

Topic		Replies	Views
Running batches of data by multiple work process Ray Core	5	536	April 6, 2022
How to run map_batches function in the same order as the blocks in the block_list Ray Data	9	920	April 12, 2023
Ray datasets streaming block split? Ray Data	1	709	June 27, 2023
[Data] map_batches is not respecting concurrency from the beginning	1	246	December 6, 2024
Single node, 4x GPU, map_batches only using 1 Ray Data	3	736	October 5, 2023

Dataset support concurrency in one block when using map_batches

Related topics