How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi all, I’m new to Ray, and i would like to use RayTune to find the best parameters for my neural network.
Things that work well before adding RayTune
I made a python script nn.py featuring a train_and_test(args)
function, which accepts parameters like {"par1":45, "par2":64, ...}
and returns the scores like this {"metric0":0.65468,"metric2":35.1345,"metric2":0.6125}
. Let’s call this part my inner loop, which just trains one model. I’m using Slurm scheduler and DistributedDataParallel
to run it over more GPUs and nodes at once. This part works, when I’m submitting my batch script with following details:
sbatch.sh
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 8
#SBATCH --gpus-per-node 8
#SBATCH --cpus-per-task 7
[...]
srun singularity exec [... bindings] $CONTAINER run-me.sh
run_me.sh
[...]
export MASTER_ADDR=$(python /workdir/get-master.py "\$SLURM_NODELIST")
export MASTER_PORT=29500
export WORLD_SIZE=$SLURM_NPROCS
export RANK=$SLURM_PROCID
export LOCAL_RANK=$SLURM_LOCALID
python -u nn.py --hidden_dim 3 [... args]
nn.py
[...]
def train_and_test(args):
[...]
return {"a":score_a, "b":score_b, "c":score_c}
if __name__ == "__main__":
args = get_args()
train_and_test(args)
Things that I imagine could work well with RayTune
Now, I would like to implement an outer loop to find the best parameters. Notice how I call 8 tasks per node for the inner loop to just call my script with python
? It works well, when I define the right rank in DistributedDataParallel
(for that I’m using the $SLURM_PROCID
variable ) .
So my idea was something like this: Let’s do it just the same way, except in one of the processes I prepare the arguments using RayTune, distribute them using MPI4PY and everything will be nice and I’ll live happily ever after.
I tried to implement it the folowing way, but I’m still getting errors.
sbatch.sh same as before
run_me.sh - same as before (except I dont need to specify so many args anymore)
nn_raytune.py - new nn.py, except with raytune
[...]
def train_and_test(args):
[...]
return {"a":score_a, "b":score_b, "c":score_c}
def trainable(config):
args = [...] #parse config into args, typically file paths and --reset option
result = train_and_test(args)
train.report(result)
if __name__ == "__main__":
args = get_args()
# new part here:
comm = MPI.COMM_WORLD
rank = comm.rank
if rank == 0:
config = {"hidden_dim": tune.randing(2,4),
# [...]
}
tuner = tune.tuner(trainable, param_space=config, tune_config=tune.TuneConfig(num_samples=-1, time_budget_s=600))
else:
tuner = None
tuner = comm.bcast(tuner, root=0)
analysis = tuner.fit()
What now?
Is this doable? I’m not sure if now I’m just dealing with only a technical issues (wrong version of some tool), or if the idea I’m thinking is doable at all.
I’m trying some different versions of RayTune (pip-installed), because that’s what people on the internet often do when they encounter this type of error:
Traceback (most recent call last):
File "/workdir/scripts/nn_raytune.py", line 337, in <module>
analysis = tuner.fit()
[...]
File "/users/username/.local/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 432, in converted_trainable
return self._converted_trainable
AttributeError: 'TunerInternal' object has no attribute '_converted_trainable'. Did you mean: 'converted_trainable'?
So, might it be the version-thingy, or am I missing something?