Are there any hacks to use nsys in Ray?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

It is very common to use NVIDIA nsight system/compute for GPU related workloads profiling. However, AFAIK, nsys profile could only be used as a launcher and should be launched before the real workloads. Attaching to an existing process using nsys is not feasible either. The most difficult obstacle lies the light abstraction of Ray, using class/function rather than process.

Are there any tricky hacks or suggestions to use nsys in Ray? For example, customize the way Ray launches new worker processes?

@cade Can you advise here?

Hi @yzs! What kind of workload do you want to profile? I think @sangcho has been thinking about ways to integrate nvidia nsight into Ray. Feel free to +1 this issue [Feature] NSight GPU profiler support · Issue #19631 · ray-project/ray · GitHub

I have a hack that enables using nsys with Ray – you can run a Ray python script using nsys and if you specify RAY_ADDRESS=local, then nsys will track all of the raylet processes too. I had to make sure I call the nvtx APIs from the driver process so that nsys correctly attaches at the “root” (driver) process (instead of a raylet), but other than that, it should work. This forgoes multi-node clusters but I’m not sure if one can even combine nsys runs from multiple nodes without Ray.

Let me know if you want help setting this up, I can share my code.

Thank both @cade and @xwjiang2010 for the reply!

What kind of workload do you want to profile

I want to profile deep learning training workloads using GPU on Ray, which may be submitted to Ray using a customized Actor. I have only tried Ray cluster on bare metal, slightly different from the original GitHub issue (Ray on k8s).

It will be really appreciated if you can share more details! I guess there are some modifications to the driver process, calling nvtx APIs?

This forgoes multi-node clusters but I’m not sure if one can even combine nsys runs from multiple nodes without Ray.

Yes. nsys could only be used at a process level. However, for distributed deep learning training profiling, we could use nsys on one of all processes among different nodes. So it will be useful if this could be applied to Ray cluster, so that the actors/functions scheduled to the node with the modified driver process could be profiled. Are there any obstacles between the local mode and ray cluster?

Another similar question to the original issue is could this be implemented as an on-demand plugin in Ray? Profiling tools like nsys introduce huge overhead due to the collection of hardware performance counters and it will be a great benefit to leverage nsys in an on-demand way.

The way I got it to work was running nsys on the driver script with RAY_ADDRESS=local. This allows nsys to trace subprocesses as well (such as the Ray workers, where tasks/actors run).

I then encountered issues in how nsys was aggregating events into the final report – I fixed this by invoking nvtx from the driver process before starting any Ray actors or tasks, e.g. cupy.cuda.nvtx.RangePush('outer_range'). Not sure exactly why this fixed it, I think it’s because the process which first invokes nsys injection into the CUDA runtime is responsible for aggregating events. Thus, if it’s a worker process is the first one to instrument the CUDA runtime, then you’ll lose events after it dies; the driver process outlives all worker processes so is a good place to do aggregation. But it’s just a hypothesis.

Hope this helps get you started!

1 Like