Resume=True fails without useful error message

hridayns · September 14, 2022, 4:10pm

Alright, would you need just the code for the custom environment? Or all of it, so then I could try and package the Docker environment I am using, so you can use it? Please advise.

hridayns · September 14, 2022, 5:19pm

I have converted the docker image into a .tar file and can be loaded with docker load --input *file.tar*. I can also send the custom environment, and instructions to run/reproduce.

The docker image is 10.7 Gb though, so not sure how I would send it? Does this platform allow that?

Please advise.

arturn · September 14, 2022, 8:47pm

The least amount of code that can reproduce the issue, please. So at best - a reproduction script!

vlainic · September 14, 2022, 8:53pm

I did not read all but did not notice any fail_fast mentions, so I would recommend adding:
fail_fast="raise", # for debugging!
into tune.run() - should return more “pythonic” error

hridayns · September 15, 2022, 10:05am

Would the Dockerfile + code for custom environment + training config be enough ? Do I zip them all and attach them here or…?

arturn · September 18, 2022, 8:32pm

If you can reproduce the issue with a Dockerfile and a custom environment, just a script would be best and the norm.

hridayns · September 20, 2022, 1:50pm

Hello, you can clone from the repository: GitHub - lcodeca/rllibsumodocker: Docker environment for RLLIB+SUMO Utils python library.

Add the contents as per the directory structure found in GitHub - hridayns/Ray-repro-scenario: Reproduction script (ease of access)

Replace Dockerfile with:

FROM tensorflow/tensorflow:latest-gpu-py3

ARG USER_ID
ARG GROUP_ID

RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list

RUN apt-key del 7fa2af80
RUN apt-get update && apt-get install -y --no-install-recommends wget
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb
RUN dpkg -i cuda-keyring_1.0-1_all.deb
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub

# RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Install system dependencies.
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install \
    cmake \
    gdb \
    git \
    htop \
    ipython3 \
    libavcodec-dev \
    libavformat-dev \
    libavutil-dev \
    libfox-1.6-0 \
    libfox-1.6-dev \
    libgdal-dev \
    libopenmpi-dev \
    librsvg2-bin \
    libspatialindex-dev \
    libswscale-dev \
    libtool \
    libxerces-c-dev \
    nano \
    psmisc \
    python3.8-dev \
    python3-tk \
    python3-virtualenv \
    python3-packaging \
    rsync \
    screen \
    sudo \
    swig \
    tmux \
    tree \
    vim \
    x11-apps \
    zlib1g-dev && apt-get autoremove -y

# Install Python 3 dependencies for SUMO and scripts

# RUN pip install --upgrade pip
RUN python -m pip install --upgrade pip

# RUN pip install --upgrade keyrings.alt 

RUN python -m pip install --default-timeout=100 --upgrade \
    aiohttp \
    deepdiff \
    dill \
    folium \
    gputil \
    grpcio \
    lxml \
    lz4 \
    matplotlib \
    numpy \
    numpyencoder \
    opencv-python \
    h5py \
    pandas \
    psutil \
    pyproj \
    rtree \
    setproctitle \
    shapely \
    tqdm \
    torchvision

# Install Python 3 dependencies for MARL
RUN python -m pip install gym && \
    python -m pip install ray==2.0.0 && \
    python -m pip install ray[debug]==2.0.0 && \
    python -m pip install ray[rllib]==2.0.0 && \
    python -m pip install ray[tune]==2.0.0   

# Working user
RUN groupadd --gid ${GROUP_ID} alice && \
    useradd -m -s /bin/bash -u ${USER_ID} -g ${GROUP_ID} alice && \
    echo "alice:alice" | chpasswd && adduser alice sudo
USER alice

# Download and install SUMO Version main
WORKDIR /home/alice
RUN git clone --depth 1 --branch main https://github.com/eclipse/sumo.git sumo && \
mkdir -p /home/alice/sumo/build/cmake-build-release

WORKDIR /home/alice/sumo/build/cmake-build-release
RUN cmake -D CHECK_OPTIONAL_LIBS=true -D CMAKE_BUILD_TYPE:STRING=Release /home/alice/sumo && \
    make -j$(nproc)

RUN echo "# SUMO" >> /home/alice/.bashrc && \
    echo "export SUMO_HOME=\"/home/alice/sumo\"" >> /home/alice/.bashrc && \
    echo "export PATH=\"\$SUMO_HOME/bin:\$PATH\"" >> /home/alice/.bashrc

# Directory structure
RUN mkdir -p /home/alice/devel  && \
    mkdir -p /home/alice/learning && \
    mkdir -p /home/alice/libraries

# Download & install RLLIB+SUMO Utils
WORKDIR /home/alice/libraries
RUN git clone --depth 1 https://github.com/hridayns/rllibsumoutils.git rllibsumoutils
WORKDIR /home/alice/libraries/rllibsumoutils
USER root
RUN python -m pip install -e .

# Learning Environment
USER alice
WORKDIR /home/alice/learning
COPY --chown=alice tf-gpu-test.py /home/alice/learning/tf-gpu-test.py
COPY --chown=alice training.sh /home/alice/learning/training.sh

USER alice
WORKDIR /home/alice/learning
CMD ["./training.sh"]

Replace docker-cmd-linux.sh code with:

#!/bin/bash

set -e
set -u

IMAGE_NAME="tf-gpu-sumo-$(date +%Y-%m-%d)"
IMAGE_FOLDER="docker-image-linux"
GPU=true
GPU_OPT="--gpus all"
OPTIRUN=false
OPTIRUN_OPT=""
BUILD=false
CACHE=false
RUN=false
SCREEN=false
EXEC=false
CONTAINER=""
DEVEL_DIR=""
LEARN_DIR=""
COMMAND=""
EXP=""
DETACH=false
SHM_SIZE="10g"

function print_help {
    echo "Parameters:"
    echo "  IMAGE name \"$IMAGE_NAME\" [-n, --image-name]"
    echo "  IMAGE folder \"$IMAGE_FOLDER\" [-f, --image-folder]"
    echo "  GPU enabled ($GPU) [--no-gpu]"
    echo "  OPTIRUN disabled ($OPTIRUN) [--with-optirun]"
    echo "  BUILD: $BUILD [-b, --build], with CACHE: $CACHE [-c, --cache]"
    echo "  RUN: $RUN [-r, --run], with SCREEN: $SCREEN [-s, --screen]"
    echo "  EXEC: $EXEC [-e, --exec], CONTAINER: \"$CONTAINER\" (use docker ps for the id)"
    echo "  COMMAND: \"$COMMAND\" [--cmd]"
    echo "  EXP: \"$EXP\" [--exp]"
    echo "  DETACH: ($DETACH) [--detach]"
    echo "  DEVELOPMENT dir \"$DEVEL_DIR\" [-d, --devel]"
    echo "  LEARNING dir \"$LEARN_DIR\" [-l, --learn]"
    echo "  SHM_SIZE \"$SHM_SIZE\" [--shm-size]"
}

for arg in "$@"
do
    case $arg in ## -l=*|--lib=*) DIR="${i#*=}" is the way to retrieve the parameter
        -n=*|--image-name=*)
        IMAGE_NAME="${arg#*=}"
        ;;
        -f=*|--image-folder=*)
        IMAGE_FOLDER="${arg#*=}"
        ;;
        --no-gpu)
        GPU=false
        GPU_OPT=""
        ;;
        --with-optirun)
        OPTIRUN=true
        OPTIRUN_OPT="optirun"
        ;;
        --detach)
        DETACH=true
        ;;
        -b|--build)
        BUILD=true
        ;;
        -c|--cache) # it does nothing without BUILD=true
        CACHE=true
        ;;
        -r|--run)
        RUN=true
        ;;
        -s|--screen) # it does nothing without RUN=true
        SCREEN=true
        ;;
        -e=*|--exec=*) # it works only with RUN=false
        EXEC=true
        CONTAINER="${arg#*=}"
        ;;
        --cmd=*)
        COMMAND="${arg#*=}"
        ;;
        --exp=*)
        EXP="${arg#*=}"
        ;;
        -d=*|--devel=*)
        DEVEL_DIR="${arg#*=}"
        ;;
        -l=*|--learn=*)
        LEARN_DIR="${arg#*=}"
        ;;
        --shm-size=*)
        SHM_SIZE="${arg#*=}"
        ;;
        *)
        # unknown option
        echo "Unknown option \"$arg\""
        print_help
        exit
        ;;
    esac
done

print_help

# Tensorflow original image
# docker run -u $(id -u):$(id -g) --gpus all -it --rm tensorflow/tensorflow:latest-gpu-py3 bash

## Building the docker image
if [[ "$BUILD" = true ]]; then
    if [[ "$CACHE" = true ]]; then
        echo "Building the docker container using the cache, if present."
        $OPTIRUN_OPT docker build \
            --build-arg USER_ID=$(id -u ${USER}) \
            --build-arg GROUP_ID=$(id -g ${USER}) \
            -t "$IMAGE_NAME" "$IMAGE_FOLDER"
    else
        echo "Building the docker container ignoring the cache, even if present."
        $OPTIRUN_OPT docker build \
            --build-arg USER_ID=$(id -u ${USER}) \
            --build-arg GROUP_ID=$(id -g ${USER}) \
            --no-cache -t "$IMAGE_NAME" "$IMAGE_FOLDER"
    fi
fi

if [[ "$RUN" = true ]]; then
    # My docker build
    MOUNT_DEVEL=""
    if [[ $DEVEL_DIR ]]; then
        MOUNT_DEVEL="--mount src=$DEVEL_DIR,target=/home/alice/devel,type=bind"
    fi
    MOUNT_LEARN=""
    if [[ $LEARN_DIR ]]; then
        MOUNT_LEARN="--mount src=$LEARN_DIR,target=/home/alice/learning,type=bind"
    fi
    CONT_NAME=""
    if [[ $EXP ]]; then
        CONT_NAME="--name $EXP"
    fi
    if [[ "$DETACH" = true ]]; then
        DETACH="-d"
    else
        DETACH=""
    fi
    CURR_UID=$(id -u)
    CURR_GID=$(id -g)

    RUN_OPT="-u $CURR_UID:$CURR_GID --net=host --env DISPLAY=$DISPLAY \
            --volume $XAUTHORITY:/home/alice/.Xauthority \
            --volume /tmp/.X11-unix:/tmp/.X11-unix \
            --privileged $MOUNT_DEVEL $MOUNT_LEARN \
            --shm-size $SHM_SIZE $GPU_OPT $CONT_NAME \
            -it $DETACH --rm $IMAGE_NAME:latest"
    echo "$OPTIRUN_OPT docker run $RUN_OPT $COMMAND"

    ## Running docker
    if [[ "$SCREEN" = true ]]; then
        echo "Running the docker in a screen session."
        screen -d -m \
            $OPTIRUN_OPT docker run $RUN_OPT $COMMAND
    else
        $OPTIRUN_OPT docker run $RUN_OPT $COMMAND
    fi
else
    if [[ "$EXEC" = true ]]; then
        echo "Attaching to a running docker (see container id using 'docker ps')."
        $OPTIRUN_OPT docker exec -it "$CONTAINER" /bin/bash
    fi
fi

and

Run in terminal: bash ~/rllibsumodocker/docker-cmd-linux.sh --image-name=tf-gpu-sumo --build --cache --run --cmd="/bin/bash" --devel="/home/**youruser**/rllibsumodocker/docker-image/devel" --shm-size="10g"

After running this, you should be inside the docker container - cd into /home/alice/devel/rllibsumoutils/pheromone-RL/pheromone-PPO

and run ./training_script.sh -nv=1000 -nl=3 -nz=10 -ls=5000 -ec=0.3 -df=0.5 -nb=1 -bls=500 -blp=2750 -bll=0 -bdur='END' -puf=1 --fast=true --nopolicy=false -mbt

I have tried my best to provide a reproduction script. Let me know if there are any issues. Thank you again for your help and support.

hridayns · September 20, 2022, 2:56pm

Hello, I have managed to run py-spy for a short period of time on one of my runs and the speedscope flamegraph that I am unable to interpret even with instructions looks like this:

However it mentions get_observation() and _process_observations() within the step() function?

I have the profile.json file but unable to attach it

hridayns · September 25, 2022, 7:10pm

Hello, is the repro script good enough? Any other information that is needed?

arturn · September 26, 2022, 9:46am

Sorry I had no time to look at this the past week.
I’m on to reproducing this. In the meantime: I’d generally advise to use the official Ray docker images Docker Hub

hridayns · September 26, 2022, 1:49pm

Hello,

sorry, I have never done this without Docker to stick as closely as possible to the same environment any of my peers use. When I first installed Docker, I believe there were some post installation steps for Linux that involved creating a docker group. Maybe this is the source of that error? Post-installation steps for Linux | Docker Documentation

Thank you.

Sincerely,
Hriday.

arturn · September 26, 2022, 4:27pm

Please provide only a repro script. At max the docker image your are using. There is just too much user code in the repro env that you are providing for me to debug this. It also uses environment variables s.a. my user id, which is very much off limits for a repro script.
The repro you provide goes way over my understanding of a minimal reproduction. If you can reproduce the issue without a container or within an official Ray container with a script, I’ll have another look. Sorry

Topic		Replies	Views
FileNotFoundError when resuming from Checkpoint Ray Tune	4	1294	August 11, 2022
I cannot resume a broken tune run	2	444	September 10, 2023
Issue while resuming "ERRORED" trials Ray Tune	2	406	September 21, 2021
Not able to resume experiment Ray Tune	5	929	December 12, 2022
Does Ray Tune restore ignore max_concurrent_trials when restarting errored trials? Ray Tune	2	268	June 30, 2023

Resume=True fails without useful error message

Related topics