How to set up distributed training with alpa on AWS?

cswaney · September 10, 2022, 2:24am

(Medium: It contributes to significant difficulty to complete my task, but I can work around it).

Has anyone had success setting up a Ray cluster on AWS for training distributed deep learning models with alpa?

Based on the documentation it seems like the thing to do is simply spin up a cluster using the rayproject/ray-ml:latest-gpu Docker image and install alpa. However, I haven’t been able to successfully install cupy-cuda on the head node, which is step one of installing alpa.

I have been able to install alpa without the Docker image by simply specifying an appropriate Deep Learning AMI, but I have the feeling that I might be missing subtle setup steps this way. I find it confusing that we should specify an AMI and a Docker file in the config file.

What does an AMI provide if we have a Docker file? What do we miss if we only specify an AMI?

Any thoughts or advice welcome

Jiao_Dong · September 13, 2022, 9:14pm

Hi @cswaney, we’re very actively working with Alpa team to make it more accessible in Ray OSS ecosystem including AWS. Feel free to tag me in Ray / Alpa slack channel for future questions

For your question, what’s the cuda / cudnn version on your host ? With ray-ml docker image, assuming you have cuda 11.2 / cudnn 8.1.1 you should be able to do pip installs with the following

alpa
tensorflow-gpu
cupy-cuda112

Then ensure
python -c "from cupy.cuda import nccl"

Returns without any error.

pip install jaxlib==0.3.15+cuda112.cudnn810 -f https://alpa-projects.github.io/wheels.html

Feel free to let me know if there’s any context I’ve missed with error messages and symptoms you’re seeing.

architkulkarni · September 14, 2022, 10:22pm

Hi @cswaney , regarding your questions about Docker/AMIs, the AMI is a required part of the VM configuration. If you’re using Docker you don’t need to be to picky with the AMI. With an AMI you can prebake images, install drivers, or pick an AMI with your favorite driver versions preinstalled.

Topic		Replies	Views
Quick question: Best practices for setting up Ray with Terraform on AWS? Ray Clusters	0	499	January 11, 2024
Ray Image does not seems to have python only when used in aws cluster Ray Clusters	0	222	October 23, 2023
Setting up ray::cluster on AWS for ray::rlllib Ray Clusters	0	343	March 24, 2023
Ray python parallel processing of deep learning model on multiple docker Ray Client	3	824	September 1, 2022
How to use my own docker image to run a local on-Premise cluster? Ray Clusters	1	876	January 5, 2022

How to set up distributed training with alpa on AWS?

Related topics