How to set up distributed training with alpa on AWS?

(Medium: It contributes to significant difficulty to complete my task, but I can work around it).

Has anyone had success setting up a Ray cluster on AWS for training distributed deep learning models with alpa?

Based on the documentation it seems like the thing to do is simply spin up a cluster using the rayproject/ray-ml:latest-gpu Docker image and install alpa. However, I haven’t been able to successfully install cupy-cuda on the head node, which is step one of installing alpa.

I have been able to install alpa without the Docker image by simply specifying an appropriate Deep Learning AMI, but I have the feeling that I might be missing subtle setup steps this way. I find it confusing that we should specify an AMI and a Docker file in the config file.

What does an AMI provide if we have a Docker file? What do we miss if we only specify an AMI?

Any thoughts or advice welcome :slight_smile:

Hi @cswaney, we’re very actively working with Alpa team to make it more accessible in Ray OSS ecosystem including AWS. Feel free to tag me in Ray / Alpa slack channel for future questions :slight_smile:

For your question, what’s the cuda / cudnn version on your host ? With ray-ml docker image, assuming you have cuda 11.2 / cudnn 8.1.1 you should be able to do pip installs with the following

alpa
tensorflow-gpu
cupy-cuda112

Then ensure
python -c "from cupy.cuda import nccl"

Returns without any error.

pip install jaxlib==0.3.15+cuda112.cudnn810 -f https://alpa-projects.github.io/wheels.html

Feel free to let me know if there’s any context I’ve missed with error messages and symptoms you’re seeing.

Hi @cswaney , regarding your questions about Docker/AMIs, the AMI is a required part of the VM configuration. If you’re using Docker you don’t need to be to picky with the AMI. With an AMI you can prebake images, install drivers, or pick an AMI with your favorite driver versions preinstalled.