I have requested 32
cpu cores and 1
gpu on a slurm-based cluster. I am trying to run RayTune using ASHA Schedular
with PyTorch Lightning. What will be the value of cpus_per_trial
? I assume if I have 32
cores available, I want to dedicate 4
CPU cores for each trial. The number of samples/trials should be 8
. Can someone confirm it?
I am getting the following errors:
2023-02-10 06:29:27,220 INFO worker.py:1538 -- Started a local Ray instance.
(raylet) [2023-02-10 06:29:33,884 E 44690 44699] logging.cc:97: Unhandled exception: St12system_error. what(): Resource temporarily unavailable
(raylet) [2023-02-10 06:29:33,924 E 44690 44699] logging.cc:104: Stack trace:
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xcf00ea) [0x1555539920ea] ray::operator<<()
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xcf28a8) [0x1555539948a8] ray::TerminateHandler()
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(+0xb0524) [0x155552b9e524] __cxxabiv1::__terminate()
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(+0xb0576) [0x155552b9e576] __cxxabiv1::__unexpected()
(raylet) conda/envs/raytune/bin/../lib/libstdc++.so.6(__cxa_current_exception_type+0) [0x155552b9e7b4] __cxa_current_exception_type
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(_ZNSt6vectorISt6threadSaIS0_EE17_M_realloc_insertIJMN3ray3rpc17ClientCallManagerEFviEPS6_RiEEEvN9__gnu_cxx17__normal_iteratorIPS0_S2_EEDpOT_+0x200) [0x15555324fde0]
std::vector<>::_M_realloc_insert<>()
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3rpc17ClientCallManagerC2ER23instrumented_io_contextil+0x222) [0x1555532576a2] ray::rpc::ClientCallManager::ClientCallManager()
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0x6408f3) [0x1555532e28f3] ray::core::CoreWorkerProcessImpl::InitializeSystemConfig()::{lambda()#1}::operator()()
(raylet) conda/envs/raytune/lib/python3.8/site-packages/ray/_raylet.so(+0xe35fd0) [0x155553ad7fd0] execute_native_thread_routine
(raylet) /lib64/libpthread.so.0(+0x8539) [0x155555117539] start_thread
(raylet) /lib64/libc.so.6(clone+0x3f) [0x155554508cff] clone
(raylet)
(raylet) *** SIGABRT received at time=1676039373 on cpu 52 ***
...
My python script is adapted from Using PyTorch Lightning with Tune; when run on a local CPU machine, it works perfectly. But I have to downgrade my requirements, e.g., fewer trials and epochs, etc in that case. On a cluster with a GPU is failing.