[I put this topic under “Ray Tune” as many Ray users that I know of started their Ray journey with that library. Feel free to change the category.]
People from various communities are using Ray for different purposes, ranging from Deep Learning model training/tuning [rllib, tune, raysgd], model serving [serve] to customized distributed application from scratch [core]. There’s a noticeable trend that applies Ray to large-scale data processing as well. That attracts more data scientists’ attention. While scientists typically start their work on a Jupyter Notebook, the onboarding experience is unfortunately unpleasant sometimes, partially due to lack of documentation. I’m starting this thread to summarize the best practices that I gathered so far. Welcome any comments or add-ons.
The discussion assumes AWS ecosystem is been used but the arguments themselves should be generic.
Ensure your instance has enough EBS volume – without doing that, with a Deep Learning AMI all pre-installed libraries and environmental set-up will consume ~76% of the disk already prior to any Ray work, and that would cause your notebook failure due to disk full. Kernel restart loses progressing cell output, especially if you rely on that to track experiment progress. Related issue: Autoscaler should allow configuration of disk space and should use a larger default. · Issue #1376 · ray-project/ray · GitHub
Consider enabling LRU Fallback – IPython stores the output of every cell in a local Python variable indefinitely. This causes Ray to pin the objects even though your application may not actually be using them. This will cause unused objects referenced by IPython to be LRU evicted when the object store is full instead of erroring. Relying on this is not recommended though. Instead, if possible you should try to remove references as they’re no longer needed in your application to free space in the object store.
Understand your node’s responsibility – are you starting a ray runtime locally on this ec2 instance, or are you going to use this ec2 instance as a cluster launcher? Jupyter notebook is more suitable for the first scenario whereas I found CLI’s such as ray exec and ray submit more handy for the second.
Forward your ports – two port forwarding required at minimum imo: jupyter notebook port and dashboard port. This may sound too obvious to existing ray users but it’s actually not trivial for many data scientists. It’d be helpful to document the steps and include them in docs.ray.io.
Understand cell output – separate application-specific part and core-specific part. Ray has improved A LOT on this over the past year, but if anything goes wrong, some core level warning/err are likely to be printed out due to Ray’s distributed nature. This gets more terrifying if you see C++ logs that are not demangled. I would suggest having a section in docs.ray.io helping users quickly identify error layer (information around this already exist but more or less spread around).
Hi @annaluo676, thanks for starting this thread, this is awesome! We should definitely include this in docs.ray.io - I’ll circulate this with some people to see if they have anything to add!
If I could expand on point 2: explicitly calling print or repr is better than letting the notebook automatically generate the output because it will not pin the objects if their only use is for printing. For example, take a cell using Modin:
df.groupby("col1").max()
This will store the entire result in a dataframe, but both Modin and Ray have no way of knowing if this is stored in the Out variable or some other external variable. Calling repr will store a string (or HTML object) in the Out variable will reduce the memory footprint of your workflow. Another option is to just altogether disable ipython caching with the following (run from bash/zsh):
Thank you both for the great explanation. Could you please confirm if these guidelines/concerns/limitations are still applicable to the latest versions of RAY and Jupyter? I am very new to both and hence, the question.
I also found this helpful, but I have another question. I’m running a jupyter notebook that submits jobs to a local ray cluster (ray start --head --port=6379).
However, each cell in the notebook submits to the same ‘JOB’ in ray - e.g., if I have two cells:
They both get submitted to ray successfully, but they show up in the same JOB ID and they stay in the ‘running’ state. I’d like to ideally have them show up as different jobs and finish the job once complete. Any idea on how to do that?