Ray Scheduler vs Spark Scheduler - Architecture understanding?

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi guys, I hope all of you are doing well.

I am a Databricks user - and I just learned to use Ray Core for my current workload to simplify the parallelization process of my team solutions, also, Ray has saved me ton-of-time when I got stuck with Spark optimization.

I am curious about how Ray Scheduler works when it scheduled tasks into CPUs (both single node / multi-node cluster).

  • How does Ray solve the issue of the user-function itself having their own parallel implementation?
  • Does Ray lock & reserve & isolate the CPU for the task-to-be-run?
  • How is Ray Scheduler different from Spark scheduler - at architecture perspective?

I ask, because I feel: Ray will isolate the CPU for the task, and don’t consider the parallel implementation of the user-function, leads to: No conflict in CPU acquire between Ray and the user-function.

I have an experiment use-case of running LightGBM on thousand data-files, that Ray helps me speed up more than 2000x versus “Spark applyInPandas”.

I tried & saw that:

  • For Spark - if the user-function has implemented the parallel run (e.g.: LightGBM, XGBoost,…), then it might create CPU acquire conflicts between Spark and itself, leads to super-super-slow on Spark solution.
  • For Spark - remove the user-function parallel implementation, then it gets dramatically speed-up, but still slower than Ray 3x-4x.

Hope we can have a good discussion together,
Duc Lai.

1 Like

Glad to hear Ray has been working out for you!

How does Ray solve the issue of the user-function itself having their own parallel implementation?

Great question, it’s done via the @ray.remote(num_cpus=x). If you specify @ray.remote(num_cpus=8) then Ray will allocate 8 cpus for your task/actor. Then you manually tell your function to use 8 cpus. Note that to simplify this process, Ray sets an environment variable OMP_NUM_THREADS which most numerical libraries (numpy, tf, pytorch, etc) respect as a way of specifying parallelism, so if you’re using these, then there’s no action required.

How does Ray solve the issue of the user-function itself having their own parallel implementation?

It does not. It only guarantees that a certain number of tasks will be running at once. Note that you can set cgroup information within your task to achieve this, there are also some proposals for a tighter integration with cgroups, but we’d love to hear more about your use case if you need this.

I have an experiment use-case of running LightGBM on thousand data-files, that Ray helps me speed up more than 2000x versus “Spark applyInPandas” .

I suspect what you’re seeing here is thrashing, since spark assumes your UDFs each require 1 cpu. In general, you want to have the same number of compute threads as CPU cores. Any less, and you aren’t fully utilizing your machine. Any more, and you’re wasting time doing context switches. In general, lower level frameworks (like tensorflow or numpy) tend to have more optimized single-machine parallel routines, so Ray lets you specify num_cpus=8 and then stays out of the way.

1 Like