Problem using more than 10-15 cpus on a server

I have a very simple raytune program (python API) that I cannot get to run consistently on a shared server. The program can sometimes run until I set num_cpus greater than about 10-15. The server is interactive and allocates cpus as needed dynamically (no SLURM or job submission needed), which might be part of the issue. The failure mode is inconsistent (different errors kill the program, or sometimes it runs without error).

Any recommendation to fix what is going wrong here, or is this entirely on the adminstrator side? Perhaps there are different ways that I could initialize Thank you

More details below.

The basic code is this:

import ray
from ray import tune
import psutil

# `num_cpus` is the number of CPUs that the program tries to allocate in advance.
# This can limit the resources available, e.g. when using a SLURM partition
num_cpus = 15
ray.init(num_cpus=num_cpus, _temp_dir=None)
  
trainable = lambda x: print("hello from cpu:", psutil.Process().cpu_num())

# `max_concurrent_trials` is how many of `num_cpus` the program will try to use
tune_config = tune.TuneConfig(
    num_samples=40,
    max_concurrent_trials=40,
    trial_dirname_creator=lambda x: "scratchwork",
)
config = {"a": tune.uniform(0, 1)}

tuner = tune.Tuner(
    trainable,
    param_space=config,
    tune_config=tune_config,
)
tuner.fit()

and results in an error:

2025-01-23 15:35:30,063 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
╭───────────────────────────────────────────────────────────────╮
│ Configuration for experiment     lambda_2025-01-23_15-35-30   │
├───────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator        │
│ Scheduler                        FIFOScheduler                │
│ Number of trials                 4                            │
╰───────────────────────────────────────────────────────────────╯

View detailed results here: /u/e6peters/ray_results/lambda_2025-01-23_15-35-30
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2025-01-23_15-35-26_816602_1860785/artifacts/2025-01-23_15-35-30/lambda_2025-01-23_15-35-30/driver_artifacts`
2025-01-23 15:35:33,013 INFO trial.py:182 -- Creating a new dirname scratchwork_1239 because trial dirname 'scratchwork' already exists.
2025-01-23 15:35:33,019 INFO trial.py:182 -- Creating a new dirname scratchwork_dd35 because trial dirname 'scratchwork' already exists.
2025-01-23 15:35:33,024 INFO trial.py:182 -- Creating a new dirname scratchwork_5275 because trial dirname 'scratchwork' already exists.

Trial status: 4 PENDING
Current time: 2025-01-23 15:35:33. Total running time: 0s
Logical resource usage: 0/15 CPUs, 0/0 GPUs
╭──────────────────────────────────────────╮
│ Trial name           status            a │
├──────────────────────────────────────────┤
│ lambda_967db_00000   PENDING    0.208699 │
│ lambda_967db_00001   PENDING    0.450595 │
│ lambda_967db_00002   PENDING    0.528995 │
│ lambda_967db_00003   PENDING    0.399411 │
╰──────────────────────────────────────────╯
[2025-01-23 15:35:33,897 E 1860785 1861242] logging.cc:108: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
[2025-01-23 15:35:34,009 E 1860785 1861242] logging.cc:115: Stack trace: 
 /u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x10b1bca) [0x7f897fb8ebca] ray::operator<<()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x10b4e52) [0x7f897fb91e52] ray::TerminateHandler()
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f897e95135a] __cxxabiv1::__terminate()
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f897e9513c5]
/u/e6peters/.conda/envs/autobots/bin/../lib/libstdc++.so.6(+0xb1658) [0x7f897e951658]
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x5f25f0) [0x7f897f0cf5f0] boost::throw_exception<>()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112a55b) [0x7f897fc0755b] boost::asio::detail::do_throw_error()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112af7b) [0x7f897fc07f7b] boost::asio::detail::posix_thread::start_thread()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x112b3dc) [0x7f897fc083dc] boost::asio::thread_pool::thread_pool()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb4a524) [0x7f897f627524] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f897f6275b9] ray::rpc::GetServerCallExecutor()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ_+0xe9) [0x7f897f31f999] std::_Function_handler<>::_M_invoke()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x899) [0x7f897f359929] ray::core::CoreWorker::HandleGetCoreWorkerStats()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x7f897f3403e4] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb5986c) [0x7f897f63686c] EventTracker::RecordExecution()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb5544e) [0x7f897f63244e] std::_Function_handler<>::_M_invoke()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xb558c6) [0x7f897f6328c6] boost::asio::detail::completion_handler<>::do_complete()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1127beb) [0x7f897fc04beb] boost::asio::detail::scheduler::do_run_one()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1129569) [0x7f897fc06569] boost::asio::detail::scheduler::run()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0x1129c72) [0x7f897fc06c72] boost::asio::io_context::run()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f897f29627d] ray::core::CoreWorker::RunIOService()
/u/e6peters/.conda/envs/autobots/lib/python3.11/site-packages/ray/_raylet.so(+0xc0a8b0) [0x7f897f6e78b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8980b43ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8980bd5850]

*** SIGABRT received at time=1737664534 on cpu 89 ***
PC: @     0x7f8980b459fc  (unknown)  pthread_kill
    @     0x7f8980af1520  (unknown)  (unknown)
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440: *** SIGABRT received at time=1737664534 on cpu 89 ***
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440: PC: @     0x7f8980b459fc  (unknown)  pthread_kill
[2025-01-23 15:35:34,010 E 1860785 1861242] logging.cc:440:     @     0x7f8980af1520  (unknown)  (unknown)
Fatal Python error: Aborted

Hello! Welcome to the Ray community Evan :slight_smile:

There’s a few things that we can do when it comes to resource allocation. Here’s some relevant docs that might help:

Docs:

That being said, I do have a few ideas that might help narrow down why this might be happening.

  1. Check System Limits: Use ulimit -a to check the system’s resource limits. You might be hitting a limit on the number of threads or processes that can be created. Adjust these limits if possible.
  2. Set OMP_NUM_THREADS: If you encounter errors like Resource temporarily unavailable , try setting OMP_NUM_THREADS=1 . This can help if the issue is related to the number of threads being created by underlying libraries.
  3. Reduce Concurrency: Adjust max_concurrent_trials to a lower number to see if it helps. This can prevent overloading the system with too many concurrent tasks.
  4. Resource Allocation: Ensure that the resources requested by Ray Tune do not exceed the available resources on the server. You might need to adjust the num_cpus parameter to match the actual available CPUs instead of hardcoding it to 15 (not sure how else you dynamically allocate it there).
  5. Debugging Tools: Use ray stack , ray timeline , and ray memory to debug unexpected hangs or performance issues.

Do you have any examples of like the errors that occur when it dies? Are there any common ones or are they all different every time?