Previously there were ways to call the torch profiler as in:
ray-project:master
← matthewdeng:pytorch-profiler
opened 11:17PM - 13 Feb 22 UTC
## Why are these changes needed?
The [original PR](https://github.com/ray… -project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.
```
| File "ray_sgd_training.py", line 18, in <module>
| from ray import train
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
| from ray.train.callbacks import TrainingCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
| from torch.profiler import profile
| ModuleNotFoundError: No module named 'torch.profiler'
```
A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:
```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
"Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
## Related issue number
Follow-up to https://github.com/ray-project/ray/pull/21864
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
However, the docs say that Ray 2.5.0 does not natively support any GPU profiling. Was there additional context around removing this capability? I’d be interested in contributing a feature if it enables tracing for code using Ray.
cc @matthewdeng Could you share some context here? Thank you!
Hey! You can still utilize Torch profiling directly in your Torch training loop, and access the created file(s) after training.
Could you share more about the type of tracing you would you be interested in adding for Ray?
cc @Huaiwei_Sun
I’m trying to use ray for metrics while launching a vanilla torch ddp train script launched with torchrun but I’m running into NCCL errors when doing ray.init()
and torch.distributed.init_process_group()
in the same program:
For context, I’m trying to write an HTTP Server (actor) to be able to trigger PyTorch traces remotely.
According to @matthewdeng , PyTorch Profiler should work out of box when you use Ray Train with it.
Are you using Ray Train?