Running torch profiler

asaiacai · July 6, 2023, 8:20am

Previously there were ways to call the torch profiler as in:

github.com/ray-project/ray

[train] add TorchTensorboardProfilerCallback

ray-project:master ← matthewdeng:pytorch-profiler

opened 11:17PM - 13 Feb 22 UTC

matthewdeng

+393 -7

## Why are these changes needed? The [original PR](https://github.com/ray…-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`. ``` | File "ray_sgd_training.py", line 18, in <module> | from ray import train | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module> | from ray.train.callbacks import TrainingCallback | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module> | from ray.train.callbacks.profile import TorchTensorboardProfilerCallback | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module> | from torch.profiler import profile | ModuleNotFoundError: No module named 'torch.profiler' ``` A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes: 1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized. 2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed: ``` >>> import ray >>> import ray.train >>> import ray.train.torch >>> from ray.train.torch import TorchWorkerProfiler >>> twp = TorchWorkerProfiler() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__ "Torch Profiler requires torch>=1.8.1. " ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler. ``` ## Related issue number Follow-up to https://github.com/ray-project/ray/pull/21864 ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(

However, the docs say that Ray 2.5.0 does not natively support any GPU profiling. Was there additional context around removing this capability? I’d be interested in contributing a feature if it enables tracing for code using Ray.

yunxuanx · July 7, 2023, 7:11am

cc @matthewdeng Could you share some context here? Thank you!

matthewdeng · July 7, 2023, 6:09pm

Hey! You can still utilize Torch profiling directly in your Torch training loop, and access the created file(s) after training.

Could you share more about the type of tracing you would you be interested in adding for Ray?

cc @Huaiwei_Sun

asaiacai · July 8, 2023, 9:10am

I’m trying to use ray for metrics while launching a vanilla torch ddp train script launched with torchrun but I’m running into NCCL errors when doing ray.init() and torch.distributed.init_process_group() in the same program:

For context, I’m trying to write an HTTP Server (actor) to be able to trigger PyTorch traces remotely.

Huaiwei_Sun · August 29, 2023, 6:15pm

However, the docs say that Ray 2.5.0 does not natively support any GPU profiling. Was there additional context around removing this capability?

PR to fix the doc

Huaiwei_Sun · August 29, 2023, 6:16pm

According to @matthewdeng , PyTorch Profiler should work out of box when you use Ray Train with it.
Are you using Ray Train?

Topic		Replies	Views
Ray Train v1.9.1: returns an AttributeError: module 'ray.train' has no attribute 'torch' Ray Train	1	1893	December 29, 2021
Ray train examples are broken Ray Train	1	598	May 10, 2022
TorchTrain fails if train_func imports functions from a different file	6	433	November 30, 2022
ModuleNotFoundError for torch Ray Tune	2	52	December 20, 2024
RayTrainReportCallback error using in Pytorch Lightning Ray Train	8	987	October 26, 2023

Running torch profiler

Related topics