I have a very simple raytune program (python API) that I cannot get to run consistently on a shared server. The program can sometimes run until I set num_cpus greater than about 10-15. The server is interactive and allocates cpus as needed dynamically (no SLURM or job submission needed), which might be part of the issue. The failure mode is inconsistent (different errors kill the program, or sometimes it runs without error).
Any recommendation to fix what is going wrong here, or is this entirely on the adminstrator side? Perhaps there are different ways that I could initialize Thank you
More details below.
The basic code is this:
import ray
from ray import tune
import psutil
# `num_cpus` is the number of CPUs that the program tries to allocate in advance.
# This can limit the resources available, e.g. when using a SLURM partition
num_cpus = 15
ray.init(num_cpus=num_cpus, _temp_dir=None)
trainable = lambda x: print("hello from cpu:", psutil.Process().cpu_num())
# `max_concurrent_trials` is how many of `num_cpus` the program will try to use
tune_config = tune.TuneConfig(
num_samples=40,
max_concurrent_trials=40,
trial_dirname_creator=lambda x: "scratchwork",
)
config = {"a": tune.uniform(0, 1)}
tuner = tune.Tuner(
trainable,
param_space=config,
tune_config=tune_config,
)
tuner.fit()
and results in an error:
2025-01-23 15:35:30,063 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
╭───────────────────────────────────────────────────────────────╮
│ Configuration for experiment lambda_2025-01-23_15-35-30 │
├───────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 4 │
╰───────────────────────────────────────────────────────────────╯
View detailed results here: /u/e6peters/ray_results/lambda_2025-01-23_15-35-30
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-01-23_15-35-26_816602_1860785/artifacts/2025-01-23_15-35-30/lambda_2025-01-23_15-35-30/driver_artifacts`
2025-01-23 15:35:33,013 INFO trial.py:182 -- Creating a new dirname scratchwork_1239 because trial dirname 'scratchwork' already exists.
2025-01-23 15:35:33,019 INFO trial.py:182 -- Creating a new dirname scratchwork_dd35 because trial dirname 'scratchwork' already exists.
2025-01-23 15:35:33,024 INFO trial.py:182 -- Creating a new dirname scratchwork_5275 because trial dirname 'scratchwork' already exists.
Trial status: 4 PENDING
Current time: 2025-01-23 15:35:33. Total running time: 0s
Logical resource usage: 0/15 CPUs, 0/0 GPUs
╭──────────────────────────────────────────╮
│ Trial name status a │
├──────────────────────────────────────────┤
│ lambda_967db_00000 PENDING 0.208699 │
│ lambda_967db_00001 PENDING 0.450595 │
│ lambda_967db_00002 PENDING 0.528995 │
│ lambda_967db_00003 PENDING 0.399411 │
╰──────────────────────────────────────────╯
[2025-01-23 15:35:33,897 E 1860785 1861242] logging.cc:108: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
[2025-01-23 15:35:34,009 E 1860785 1861242] logging.cc:115: Stack trace:
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x10b1bca) [0x7f897fb8ebca] ray::operator<<()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x10b4e52) [0x7f897fb91e52] ray::TerminateHandler()
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f897e95135a] __cxxabiv1::__terminate()
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f897e9513c5]
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb1658) [0x7f897e951658]
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x5f25f0) [0x7f897f0cf5f0] boost::throw_exception<>()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112a55b) [0x7f897fc0755b] boost::asio::detail::do_throw_error()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112af7b) [0x7f897fc07f7b] boost::asio::detail::posix_thread::start_thread()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112b3dc) [0x7f897fc083dc] boost::asio::thread_pool::thread_pool()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb4a524) [0x7f897f627524] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f897f6275b9] ray::rpc::GetServerCallExecutor()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0xe9) [0x7f897f31f999] std::_Function_handler<>::_M_invoke()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x899) [0x7f897f359929] ray::core::CoreWorker::HandleGetCoreWorkerStats()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x7f897f3403e4] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb5986c) [0x7f897f63686c] EventTracker::RecordExecution()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb5544e) [0x7f897f63244e] std::_Function_handler<>::_M_invoke()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb558c6) [0x7f897f6328c6] boost::asio::detail::completion_handler<>::do_complete()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1127beb) [0x7f897fc04beb] boost::asio::detail::scheduler::do_run_one()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1129569) [0x7f897fc06569] boost::asio::detail::scheduler::run()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1129c72) [0x7f897fc06c72] boost::asio::io_context::run()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f897f29627d] ray::core::CoreWorker::RunIOService()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xc0a8b0) [0x7f897f6e78b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8980b43ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8980bd5850]
*** SIGABRT received at time=1737664534 on cpu 89 ***
PC: @ 0x7f8980b459fc (unknown) pthread_kill
@ 0x7f8980af1520 (unknown) (unknown)
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440: *** SIGABRT received at time=1737664534 on cpu 89 ***
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440: PC: @ 0x7f8980b459fc (unknown) pthread_kill
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440: @ 0x7f8980af1520 (unknown) (unknown)
Fatal Python error: Aborted