Running RayTune on Slurm Cluster in PyTorch Lightning

I have requested 32 cpu cores and 1 gpu on a slurm-based cluster. I am trying to run RayTune using ASHA Schedular with PyTorch Lightning. What will be the value of cpus_per_trial? I assume if I have 32 cores available, I want to dedicate 4 CPU cores for each trial. The number of samples/trials should be 8. Can someone confirm it?

I am getting the following errors:

2023-02-10 06:29:27,220 INFO worker.py:1538 -- Started a local Ray instance.                                                                                                                                                                                            
(raylet) [2023-02-10 06:29:33,884 E 44690 44699] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable                                                                                                                         
(raylet) [2023-02-10 06:29:33,924 E 44690 44699] logging.cc:104: Stack trace:                                                                                                                                                                                           
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xcf00ea) [0x1555539920ea] ray::operator<<()                                                                                                                      
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xcf28a8) [0x1555539948a8] ray::TerminateHandler()                                                                                                                 
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(+0xb0524) [0x155552b9e524] __cxxabiv1::__terminate()                                                                                                                                    
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(+0xb0576) [0x155552b9e576] __cxxabiv1::__unexpected()                                                                                                                                   
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(__cxa_current_exception_type+0) [0x155552b9e7b4] __cxa_current_exception_type                                                                                                           
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(_ZNSt6vectorISt6threadSaIS0_EE17_M_realloc_insertIJMN3ray3rpc17ClientCallManagerEFviEPS6_RiEEEvN9__gnu_cxx17__normal_iteratorIPS0_S2_EEDpOT_+0x200) [0x15555324fde0]
 std::vector<>::_M_realloc_insert<>()                                                                                                                                                                                                                                   
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3rpc17ClientCallManagerC2ER23instrumented_io_contextil+0x222) [0x1555532576a2] ray::rpc::ClientCallManager::ClientCallManager()                              
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0x6408f3) [0x1555532e28f3] ray::core::CoreWorkerProcessImpl::InitializeSystemConfig()::{lambda()#1}::operator()()                                                  
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xe35fd0) [0x155553ad7fd0] execute_native_thread_routine                                                                                                           
(raylet) /lib64/libpthread.so.0(+0x8539) [0x155555117539] start_thread                                                                                                                                                                                                  
(raylet) /lib64/libc.so.6(clone+0x3f) [0x155554508cff] clone                                                                                                                                                                                                            
(raylet)                                                                                                                                                                                                                                                                
(raylet) *** SIGABRT received at time=1676039373 on cpu 52 ***
...

My python script is adapted from Using PyTorch Lightning with Tune; when run on a local CPU machine, it works perfectly. But I have to downgrade my requirements, e.g., fewer trials and epochs, etc in that case. On a cluster with a GPU is failing.

copying over response from another thread:

I’ve never run ray on a slurm-based cluster. But looking at the error message, it doesn’t look like Tune issue. Could you maybe run some basic ray script in the cluster just to make sure the basics works in the environment?

Have you given these a try?