Are there any hacks to use nsys in Ray?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

It is very common to use NVIDIA nsight system/compute for GPU related workloads profiling. However, AFAIK, nsys profile could only be used as a launcher and should be launched before the real workloads. Attaching to an existing process using nsys is not feasible either. The most difficult obstacle lies the light abstraction of Ray, using class/function rather than process.

Are there any tricky hacks or suggestions to use nsys in Ray? For example, customize the way Ray launches new worker processes?

@cade Can you advise here?

Hi @yzs! What kind of workload do you want to profile? I think @sangcho has been thinking about ways to integrate nvidia nsight into Ray. Feel free to +1 this issue [Feature] NSight GPU profiler support · Issue #19631 · ray-project/ray · GitHub

I have a hack that enables using nsys with Ray – you can run a Ray python script using nsys and if you specify RAY_ADDRESS=local, then nsys will track all of the raylet processes too. I had to make sure I call the nvtx APIs from the driver process so that nsys correctly attaches at the “root” (driver) process (instead of a raylet), but other than that, it should work. This forgoes multi-node clusters but I’m not sure if one can even combine nsys runs from multiple nodes without Ray.

Let me know if you want help setting this up, I can share my code.

Thank both @cade and @xwjiang2010 for the reply!

What kind of workload do you want to profile

I want to profile deep learning training workloads using GPU on Ray, which may be submitted to Ray using a customized Actor. I have only tried Ray cluster on bare metal, slightly different from the original GitHub issue (Ray on k8s).

It will be really appreciated if you can share more details! I guess there are some modifications to the driver process, calling nvtx APIs?

This forgoes multi-node clusters but I’m not sure if one can even combine nsys runs from multiple nodes without Ray.

Yes. nsys could only be used at a process level. However, for distributed deep learning training profiling, we could use nsys on one of all processes among different nodes. So it will be useful if this could be applied to Ray cluster, so that the actors/functions scheduled to the node with the modified driver process could be profiled. Are there any obstacles between the local mode and ray cluster?

Another similar question to the original issue is could this be implemented as an on-demand plugin in Ray? Profiling tools like nsys introduce huge overhead due to the collection of hardware performance counters and it will be a great benefit to leverage nsys in an on-demand way.

The way I got it to work was running nsys on the driver script with RAY_ADDRESS=local. This allows nsys to trace subprocesses as well (such as the Ray workers, where tasks/actors run).

I then encountered issues in how nsys was aggregating events into the final report – I fixed this by invoking nvtx from the driver process before starting any Ray actors or tasks, e.g. cupy.cuda.nvtx.RangePush('outer_range'). Not sure exactly why this fixed it, I think it’s because the process which first invokes nsys injection into the CUDA runtime is responsible for aggregating events. Thus, if it’s a worker process is the first one to instrument the CUDA runtime, then you’ll lose events after it dies; the driver process outlives all worker processes so is a good place to do aggregation. But it’s just a hypothesis.

Hope this helps get you started!

1 Like

Hi @cade! I am trying to profile some cupy code running in ray with nsys. I have set RAY_ADDRESS=local and cupy.cuda.nvtx.RangePush(), but the nsys output an erorr

Creating final output files...
Processing [===============================================================100%]

**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/EventHandler/PerfEventHandler.cpp(501): Throw in function void QuadDAnalysis::EventHandler::PerfEventHandler::PutCpuEvent(QuadDCommon::CpuId, QuadDAnalysis::EventHandler::PerfEventHandler::EventPtr)\nDynamic exception type: boost::exception_detail::clone_impl<QuadDAnalysis::ChronologicalOrderError>\nstd::exception::what: ChronologicalOrderError\n[QuadDCommon::tag_message*] = Cpu event chronological order was broken.\n"
      }
    }
  }
}


**** Errors occurred while processing the raw events. ****
**** Please see the Diagnostics Summary page after opening the qdrep file in GUI. ****

Saved report file to "/tmp/nsys-report-194f-17dc-c90f-440d.qdrep"
Importation succeeded with non-fatal errors.
Report file moved to "/mnt/sda/2022-0526/home/hlh/npbench-ray/gemm-tf32-fp32-ray.qdstrm"
Report file moved to "/mnt/sda/2022-0526/home/hlh/npbench-ray/gemm-tf32-fp32-ray.qdrep"

In nsys GUI, the erorr message is Analysis 00:11.898 Some events (18,868) were lost. Certain charts (including CPU utilization) on the timeline may display incorrect data. Try to decrease sampling rate and start a new profiling session. But in GUI , it seems everything is ok.

Is this irrelevant? Or could you please share a example of how you set cupy.cuda.nvtx.RangePush('outer_range'), I am not sure if I make it right.

Can you share your code? I have seen nsys drop events before, not really sure what causes it to do that.

Here are my codes, it’s a gemm function modified from npbench. I use nsys with nsys profile -o gemm-tf32-fp32-ray --force-overwrite=true --trace=cuda,cudnn,cublas,osrt,nvtx python run.py --preset L. And I found that not always get this error message, sometimes when I modify a few lines of codes, the message disappear.

#file name: run.py
import ray
import time
import argparse
import cupy as np
from jacobi_2d_cupy import jacobi_2d
from cavity_flow_cupy import cavity_flow
from conv2d_cupy import conv2d_bias
from cholesky2_cupy import cholesky2
from gemm_cupy import gemm_fp32,gemm_tf32,gemm_fp64

from ray.experimental.state.api import summarize_tasks

from cupy.cuda.nvtx import RangePush, RangePop

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    
    parser.add_argument(
        "--preset",
        type=str,
        default="M",
        help="Set size of the problems.",
    )
    
    args, _ = parser.parse_known_args()
    datatype = np.float32
    ray.init()
    print("sending task...")

    results = []
    start = time.time()
    RangePush("Nested Powers of A")

    for i in range(1):
        results.append(gemm_fp32.remote(args.preset))
        results.append(gemm_tf32.remote(args.preset))

    RangePop()
    
    
    result = ray.get(results)
    print("total time1 =", time.time() - start)

    for i in range(len(result)):
        print(result[i])

    
    ray.timeline(filename="./timeline/timeline-gemm-tf32-fp32.json")
#file name: gemm_cupy.py
import cupy as np
import ray
import time
import os
from cupy.cuda.nvtx import RangePush, RangePop

# "S": { "NI": 1000, "NJ": 1100, "NK": 1200 },
#             "M": { "NI": 2500, "NJ": 2750, "NK": 3000 },
#             "L": { "NI": 7000, "NJ": 7500, "NK": 8000 },
#             "paper": { "NI": 2000, "NJ": 2300, "NK": 2600 }

def initialize(NI, NJ, NK, datatype=np.float64):
    alpha = datatype(1.5)
    beta = datatype(1.2)
    C = np.fromfunction(lambda i, j: ((i * j + 1) % NI) / NI, (NI, NJ),
                        dtype=datatype)
    A = np.fromfunction(lambda i, k: (i * (k + 1) % NK) / NK, (NI, NK),
                        dtype=datatype)
    B = np.fromfunction(lambda k, j: (k * (j + 2) % NJ) / NJ, (NK, NJ),
                        dtype=datatype)

    return alpha, beta, C, A, B
@ray.remote(num_gpus=0.1)
def gemm_fp32(preset = "S"):
    os.environ['CUPY_TF32']="0"
    datatype = np.float32
    
    if(preset =="S"):
        NI = 1000
        NJ = 1100
        NK = 1200
    elif(preset == "M"):
        NI = 2500
        NJ = 2750
        NK = 3000
    elif(preset == "L"):
        NI = 7000
        NJ = 7500
        NK = 8000
    elif(preset == "U"):
        NI = 7000*2
        NJ = 7500*2
        NK = 8000
    elif(preset == "paper"):
        NI = 2000
        NJ = 2300
        NK = 2600

    # alpha,beta,C,A,B = initialize(NI,NJ,NK,datatype)
    alpha = datatype(1.5)
    beta = datatype(1.2)
    C = np.fromfunction(lambda i, j: ((i * j + 1) % NI) / NI, (NI, NJ),
                        dtype=datatype)
    A = np.fromfunction(lambda i, k: (i * (k + 1) % NK) / NK, (NI, NK),
                        dtype=datatype)
    B = np.fromfunction(lambda k, j: (k * (j + 2) % NJ) / NJ, (NK, NJ),
                        dtype=datatype)
    
   

    
    RangePush("FP32")
    
    stream = np.cuda.Stream(non_blocking=True)
    
    with stream:
        startEvent = stream.record()
        for _ in range(100):
            C[:] = alpha * A @ B + beta * C
        endEvent = stream.record()
    stream.synchronize()
    RangePop()
    total = np.cuda.get_elapsed_time(startEvent,endEvent)

    result = ["fp32",total/1000]
    return result

@ray.remote(num_gpus=0.1)
def gemm_tf32(preset = "S"):
    datatype = np.float32
    os.environ['CUPY_TF32']="1"

    if(preset =="S"):
        NI = 1000
        NJ = 1100
        NK = 1200
    elif(preset == "M"):
        NI = 2500
        NJ = 2750
        NK = 3000
    elif(preset == "L"):
        NI = 7000
        NJ = 7500
        NK = 8000
    elif(preset == "U"):
        NI = 7000*2
        NJ = 7500*2
        NK = 8000
    elif(preset == "paper"):
        NI = 2000
        NJ = 2300
        NK = 2600

    # alpha,beta,C,A,B = initialize(NI,NJ,NK,datatype)
    alpha = datatype(1.5)
    beta = datatype(1.2)
    C = np.fromfunction(lambda i, j: ((i * j + 1) % NI) / NI, (NI, NJ),
                        dtype=datatype)
    A = np.fromfunction(lambda i, k: (i * (k + 1) % NK) / NK, (NI, NK),
                        dtype=datatype)
    B = np.fromfunction(lambda k, j: (k * (j + 2) % NJ) / NJ, (NK, NJ),
                        dtype=datatype)
    
   

    RangePush("TF32")
    stream = np.cuda.Stream(non_blocking=True)
    
    # print("tf stream: ",np.cuda.get_current_stream())
    start = time.time()
    with stream:
        startEvent = stream.record()
        for _ in range(100):
            C[:] = alpha * A @ B + beta * C
        endEvent = stream.record()
    stream.synchronize()
    RangePop()
    total = np.cuda.get_elapsed_time(startEvent,endEvent)
    # total = time.time()-start
    # with cp.cuda.Device(0):

    #     g_A = cp.asarray(A)
    #     g_B = cp.asarray(B)
    #     g_C = cp.asarray(C)

    #     g_C[:] = g_alpha * g_A @ g_B + g_beta * g_C
    result = ["tf32",total/1000]
    return result

@ray.remote(num_gpus=0.5)
def gemm_fp64(preset = "S"):

    datatype = np.float64
    os.environ['CUPY_TF32']="0"

    if(preset =="S"):
        NI = 1000
        NJ = 1100
        NK = 1200
    elif(preset == "M"):
        NI = 2500
        NJ = 2750
        NK = 3000
    elif(preset == "L"):
        NI = 7000
        NJ = 7500
        NK = 8000
    elif(preset == "U"):
        NI = 7000*2
        NJ = 7500*2
        NK = 8000
    elif(preset == "paper"):
        NI = 2000
        NJ = 2300
        NK = 2600

    # alpha,beta,C,A,B = initialize(NI,NJ,NK,datatype)
    alpha = datatype(1.5)
    beta = datatype(1.2)
    C = np.fromfunction(lambda i, j: ((i * j + 1) % NI) / NI, (NI, NJ),
                        dtype=datatype)
    A = np.fromfunction(lambda i, k: (i * (k + 1) % NK) / NK, (NI, NK),
                        dtype=datatype)
    B = np.fromfunction(lambda k, j: (k * (j + 2) % NJ) / NJ, (NK, NJ),
                        dtype=datatype)
    
   

    start = time.time()
    for _ in range(20):
        C[:] = alpha * A @ B + beta * C
    total = time.time()-start
    # with cp.cuda.Device(0):

    #     g_A = cp.asarray(A)
    #     g_B = cp.asarray(B)
    #     g_C = cp.asarray(C)

    #     g_C[:] = g_alpha * g_A @ g_B + g_beta * g_C
    result = ["fp64",total]
    return result

I think unless there are events missing in the GUI that you expect to see, then nsys is working correctly. I think especially for cpu stack sampling it can drop many events (as it should) when it is oversaturated.

Your code looks correct – as long as you run Ray in local mode RAY_ADDRESS=local this will interact correctly with nsys.

(also didn’t know you could do C[:] = alpha * A @ B + beta * C… is this compiled to a single cuda kernel, or does it dispatch one for ever operation?)

Anyone try to use nsys with class actor ? I can capture gpu’s infor in function task, but I can’t capture anything with class actor. Anyony can help me? Following is my test code:

import torch
import ray
import cupy

ray.init()

@ray.remote(
    num_gpus=1,
)
class RayActor:
    def run(self):
        a = torch.randint(0, 2, [128, 2, 2048, 2048]).cuda()
        b = torch.randint(0, 2, [128, 2, 2048, 2048]).cuda()
        for _ in range(100):
            c = a * b
        print("Result on GPU:", c)

ray_actor = RayActor.remote()
ray.get(ray_actor.run.remote())

I run the code with following command:

export RAY_ADDRESS=local
nsys profile -o ns5 python test.py

nsight support is here: Profiling — Ray 2.9.1