(Medium: It contributes to significant difficulty to complete my task, but I can work around it).
Has anyone had success setting up a Ray cluster on AWS for training distributed deep learning models with alpa?
Based on the documentation it seems like the thing to do is simply spin up a cluster using the rayproject/ray-ml:latest-gpu Docker image and install alpa. However, I haven’t been able to successfully install cupy-cuda on the head node, which is step one of installing alpa.
I have been able to install alpa without the Docker image by simply specifying an appropriate Deep Learning AMI, but I have the feeling that I might be missing subtle setup steps this way. I find it confusing that we should specify an AMI and a Docker file in the config file.
What does an AMI provide if we have a Docker file? What do we miss if we only specify an AMI?
Hi @cswaney, we’re very actively working with Alpa team to make it more accessible in Ray OSS ecosystem including AWS. Feel free to tag me in Ray / Alpa slack channel for future questions
For your question, what’s the cuda / cudnn version on your host ? With ray-ml docker image, assuming you have cuda 11.2 / cudnn 8.1.1 you should be able to do pip installs with the following
alpa
tensorflow-gpu
cupy-cuda112
Then ensure python -c "from cupy.cuda import nccl"
Hi @cswaney , regarding your questions about Docker/AMIs, the AMI is a required part of the VM configuration. If you’re using Docker you don’t need to be to picky with the AMI. With an AMI you can prebake images, install drivers, or pick an AMI with your favorite driver versions preinstalled.