Getting Started with Ray does not work on any computer I try it

foshea · September 12, 2023, 10:49pm

I am once again trying to get Ray working on a cluster I work on. I am trying to look at basic functionality and I can’t even get this “getting started” tutorial to run.

I’ve dumped a copy of the script I am running at the bottom of this post. The cluster uses slurm to provision resources, and I want to walk before I run, so I’m doing this on my MacBook

First, I make a fresh conda environment

conda create --name=raytune python=3.11
/path/to/conda/env/bin/pip install -U "ray[air]"
/path/to/conda/env/bin/pip install torch torchvision torchaudio

All of this goes off without a hitch. Python is version 3.11.5, ray is version 2.6.3, and torch is version 2.0.1.

The first time I ran it (python getting-started.py), I saw this:

CUDA is available in pytorch: False

<messages about downloading the MNIST data>

2023-09-12 14:59:11,572	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
[2023-09-12 14:59:42,795 E 21473 4607747] core_worker.cc:201: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The next time I ran it, I saw the following:

CUDA is available in pytorch: False
2023-09-12 15:16:12,792	ERROR node.py:605 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-09-12 15:16:19,767	ERROR node.py:605 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
^CTraceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2120, in ray._raylet._auto_reconnect.wrapper
  File "python/ray/_raylet.pyx", line 2185, in ray._raylet.GcsClient.internal_kv_get
  File "python/ray/_raylet.pyx", line 410, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/foshea/Documents/Projects/raytune/getting-started.py", line 130, in <module>
    results = tuner.fit()
              ^^^^^^^^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tuner.py", line 347, in fit
    return self._local_tuner.fit()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
    analysis = self._fit_internal(trainable, param_space)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 703, in _fit_internal
    analysis = run(
               ^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tune.py", line 573, in run
    _ray_auto_init(entrypoint=error_message_map["entrypoint"])
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tune.py", line 225, in _ray_auto_init
    ray.init()
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/worker.py", line 1514, in init
    _global_node = ray._private.node.Node(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 287, in __init__
    self.start_head_processes()
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 1160, in start_head_processes
    self.start_gcs_server()
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 992, in start_gcs_server
    self._init_gcs_client()
  File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 605, in _init_gcs_client
    client.internal_kv_get(b"dummy", None)
  File "python/ray/_raylet.pyx", line 2140, in ray._raylet._auto_reconnect.wrapper
KeyboardInterrupt

What happens if I run this on a cluster (with a GPU)? This is the 3rd run, I got slightly different errors each time:

CUDA is available in pytorch: True
2023-09-12 14:34:40,342	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2023-09-12 14:34:49,564	INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2023-09-12 14:34:49,579	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
(pid=44135) [2023-09-12 14:34:50,301 E 44135 44390] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
[2023-09-12 14:34:50,447 E 42180 44126] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     train_mnist_2023-09-12_14-34-28   │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator             │
│ Scheduler                        FIFOScheduler                     │
│ Number of trials                 1                                 │
╰────────────────────────────────────────────────────────────────────╯

It seems like the tutorial is missing some critical step to getting ray to work. Any suggestions?

The script:

# https://docs.ray.io/en/latest/tune/getting-started.html

import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F

from ray import air, tune
from ray.air import session, RunConfig
from ray.tune.search import ConcurrencyLimiter
from ray.tune.schedulers import ASHAScheduler


DATA_DIR = '/Users/foshea/Documents/Projects/raytune/data'
STORAGE_DIR = '/Users/foshea/Documents/Projects/raytune/ray_results'

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        # In this example, we don't change the model architecture
			# due to simplicity.
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)



# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256

def train(model, optimizer, train_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # We set this just for the example to run quickly.
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()


def test(model, data_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            # We set this just for the example to run quickly.
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total


def train_mnist(config):
    # Data Setup
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    train_loader = DataLoader(
        datasets.MNIST(DATA_DIR, train=True, download=True, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)
    test_loader = DataLoader(
        datasets.MNIST(DATA_DIR, train=False, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = ConvNet()
    model.to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=config["lr"], momentum=config["momentum"])
    for i in range(10):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)

        # Send the current training result back to Tune
        session.report({"mean_accuracy": acc})

        if i % 5 == 0:
            # This saves the model to the trial directory
            torch.save(model.state_dict(), "./model.pth")


if __name__ == "__main__":

    search_space = {
        "lr": tune.sample_from(lambda spec: 10 ** (-10 * np.random.rand())),
        "momentum": tune.uniform(0.1, 0.9),
    }

    print('CUDA is available in pytorch:', torch.cuda.is_available())

    # Uncomment this to enable distributed execution
    # `ray.init(address="auto")`

    # Download the dataset first
    datasets.MNIST(DATA_DIR, train=True, download=True)

    tuner = tune.Tuner(
        train_mnist,
        param_space=search_space,
        run_config=RunConfig(storage_path=STORAGE_DIR),
        # tune_config=tune.TuneConfig(max_concurrent_trials=1)
    )

    results = tuner.fit()

    dfs = {result.log_dir: result.metrics_dataframe for result in results}
    [d.mean_accuracy.plot() for d in dfs.values()]

foshea · September 12, 2023, 11:46pm

I have tried to do this using python 3.10, because apparently support for 3.11 is experimental. No change.

In addition, I have tried a different quick start guide here.

Full script:

# https://docs.ray.io/en/latest/tune/tutorials/tune-run.html

from ray import tune
import ray
import os

NUM_MODELS = 100

def train_model(config):
    score = config["model_id"]

    # Import model libraries, etc...
    # Load data and train model code here...

    # Return final stats. You can also return intermediate progress
    # using ray.air.session.report() if needed.
    # To return your model, you could write it to storage and return its
    # URI in this dict, or return it as a Tune Checkpoint:
    # https://docs.ray.io/en/latest/tune/tutorials/tune-checkpoints.html
    return {"score": score}

# Define trial parameters as a single grid sweep.
trial_space = {
    # This is an example parameter. You could replace it with filesystem paths,
    # model types, or even full nested Python dicts of model configurations, etc.,
    # that enumerate the set of trials to run.
    "model_id": tune.grid_search([
        "model_{}".format(i)
        for i in range(NUM_MODELS)
    ])
}

# Can customize resources per trial, here we set 1 CPU each.
train_model = tune.with_resources(train_model, {"cpu": 1})

# Start a Tune run and print the best result.
tuner = tune.Tuner(train_model, param_space=trial_space)
results = tuner.fit()

# Access individual results.
print(results[0])
print(results[1])
print(results[2])

This script returns a similar error:

2023-09-12 16:42:10,868	ERROR node.py:605 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.

matthewdeng · September 13, 2023, 6:31pm

Can you see any more errors here?

I think one thing to try is a simple Ray Core script, which might help isolate the problem. Could you try the examples in the Ray Core Quickstart section and see if the error is reproducible?

foshea · September 13, 2023, 7:02pm

I’ve already tried the smallest example I can find:

# https://docs.ray.io/en/latest/ray-overview/getting-started.html
# Tune: Hyperparameter Tuning at Scale

from ray import tune


def objective(config):  # ①
    score = config["a"] ** 2 + config["b"]
    return {"score": score}


search_space = {  # ②
    "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
    "b": tune.choice([1, 2, 3]),
}

tuner = tune.Tuner(objective, param_space=search_space)  # ③

results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)

Same results as above.

Edit: I forgot to answer this part:

Can you see any more errors here?

No, the error you quote repeats every 20-30 seconds until I ctrl-C to stop the thing from running. I usually only wait a minute.

foshea · September 13, 2023, 8:06pm

I tried a cache-free install on the cluster I use:

conda create --name=raytune2 python=3.10
/path/to/conda/env/bin/pip install -U "ray[air]"==2.5.1 --no-cache-dir

I then ran the example in my last post and saw this:

2023-09-13 12:56:50,516	ERROR services.py:1207 -- Failed to start the dashboard , return code 1
2023-09-13 12:56:50,516	ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-13 12:56:50,516	ERROR services.py:1276 -- 
The last 20 lines of /tmp/ray/session_2023-09-13_12-56-47_130420_81132/logs/dashboard.log (it contains the error message from the dashboard): 
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/site-packages/ray/dashboard/modules/log/log_manager.py", line 8, in <module>
    from ray.util.state.common import (
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/site-packages/ray/util/state/__init__.py", line 1, in <module>
    from ray.util.state.api import (
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/site-packages/ray/util/state/api.py", line 17, in <module>
    from ray.util.state.common import (
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/site-packages/ray/util/state/common.py", line 120, in <module>
    @dataclass(init=True)
  File "/sdf/group/ml/bes_anomalies/conda/envs/raytune2/lib/python3.10/site-packages/pydantic/dataclasses.py", line 141, in dataclass
    assert init is False, 'pydantic.dataclasses.dataclass only supports init=False'
AssertionError: pydantic.dataclasses.dataclass only supports init=False
2023-09-13 12:56:50,741	INFO worker.py:1636 -- Started a local Ray instance.
[2023-09-13 12:56:58,709 E 81132 81132] core_worker.cc:193: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

When I try to install the latest version (2.6.3) the same way as above, I get the following when I try to run the small example:

2023-09-13 13:04:18,198	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2023-09-13 13:04:32,319	INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2023-09-13 13:04:32,322	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭──────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     objective_2023-09-13_13-04-13   │
├──────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator           │
│ Scheduler                        FIFOScheduler                   │
│ Number of trials                 4                               │
╰──────────────────────────────────────────────────────────────────╯

View detailed results here: /sdf/home/f/foshea/ray_results/objective_2023-09-13_13-04-13
To visualize your results with TensorBoard, run: `tensorboard --logdir /sdf/home/f/foshea/ray_results/objective_2023-09-13_13-04-13`

Trial status: 4 PENDING
Current time: 2023-09-13 13:04:32. Total running time: 0s
Logical resource usage: 0/128 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:A100)
╭────────────────────────────────────────────────╮
│ Trial name              status       b       a │
├────────────────────────────────────────────────┤
│ objective_c1063_00000   PENDING      3   0.001 │
│ objective_c1063_00001   PENDING      2   0.01  │
│ objective_c1063_00002   PENDING      1   0.1   │
│ objective_c1063_00003   PENDING      3   1     │
╰────────────────────────────────────────────────╯

[2023-09-13 13:04:33,036 E 83474 83879] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
(bundle_reservation_check_func pid=84000) <jemalloc>: arena 0 background thread creation failed (11)
(bundle_reservation_check_func pid=84000) [2023-09-13 13:04:33,040 E 84000 86404] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
(pid=83887) [2023-09-13 13:04:33,219 E 83887 84178] logging.cc:104: Stack trace: 
(pid=83887)  /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xe4bc3a) [0x7f05f1fb8c3a] ray::operator<<()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xe4e3f8) [0x7f05f1fbb3f8] ray::TerminateHandler()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f05f0de735a] __cxxabiv1::__terminate()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f05f0de73c5]
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb1658) [0x7f05f0de7658]
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x4eec12) [0x7f05f165bc12] boost::throw_exception<>()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3ac4b) [0x7f05f20a7c4b] boost::asio::detail::do_throw_error()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3b66b) [0x7f05f20a866b] boost::asio::detail::posix_thread::start_thread()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3bacc) [0x7f05f20a8acc] boost::asio::thread_pool::thread_pool()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x946844) [0x7f05f1ab3844] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f05f1ab38d9] ray::rpc::GetServerCallExecutor()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI_+0x128) [0x7f05f1816bf8] std::_Function_handler<>::_M_invoke()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8f1) [0x7f05f1851051] ray::core::CoreWorker::HandleGetCoreWorkerStats()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyEE17HandleRequestImplEv+0x112) [0x7f05f1847dd2] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x9e9706) [0x7f05f1b56706] EventTracker::RecordExecution()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x98661e) [0x7f05f1af361e] std::_Function_handler<>::_M_invoke()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x986b76) [0x7f05f1af3b76] boost::asio::detail::completion_handler<>::do_complete()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf383db) [0x7f05f20a53db] boost::asio::detail::scheduler::do_run_one()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf39ea9) [0x7f05f20a6ea9] boost::asio::detail::scheduler::run()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3a362) [0x7f05f20a7362] boost::asio::io_context::run()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f05f185e0ed] ray::core::CoreWorker::RunIOService()
(pid=83887) /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf6af80) [0x7f05f20d7f80] execute_native_thread_routine
(pid=83887) /lib64/libpthread.so.0(+0x7ea5) [0x7f05f9979ea5] start_thread
(pid=83887) /lib64/libc.so.6(clone+0x6d) [0x7f05f8f99b0d] clone
(pid=83887) 
(pid=83887) *** SIGABRT received at time=1694635473 on cpu 66 ***
(pid=83887) PC: @     0x7f05f8ed1387  (unknown)  raise
(pid=83887)     @     0x7f05f9981630       1920  (unknown)
(pid=83887)     @     0x7f05f0de735a  (unknown)  __cxxabiv1::__terminate()
(pid=83887)     @     0x7f05f0de7580  (unknown)  (unknown)
(pid=83887) [2023-09-13 13:04:33,220 E 83887 84178] logging.cc:361: *** SIGABRT received at time=1694635473 on cpu 66 ***
(pid=83887) [2023-09-13 13:04:33,220 E 83887 84178] logging.cc:361: PC: @     0x7f05f8ed1387  (unknown)  raise
(pid=83887) [2023-09-13 13:04:33,220 E 83887 84178] logging.cc:361:     @     0x7f05f9981630       1920  (unknown)
(pid=83887) [2023-09-13 13:04:33,220 E 83887 84178] logging.cc:361:     @     0x7f05f0de735a  (unknown)  __cxxabiv1::__terminate()
(pid=83887) [2023-09-13 13:04:33,220 E 83887 84178] logging.cc:361:     @     0x7f05f0de7580  (unknown)  (unknown)
(pid=83887) Fatal Python error: Aborted
(pid=83887) 
(pid=83887) 
(pid=83887) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, ray._raylet, charset_normalizer.md (total: 8)
[2023-09-13 13:04:33,306 E 83474 83879] logging.cc:104: Stack trace: 
 /sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xe4bc3a) [0x7f684ed65c3a] ray::operator<<()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xe4e3f8) [0x7f684ed683f8] ray::TerminateHandler()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f684dbaf35a] __cxxabiv1::__terminate()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f684dbaf3c5]
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/bin/../lib/libstdc++.so.6(+0xb1658) [0x7f684dbaf658]
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x4eec12) [0x7f684e408c12] boost::throw_exception<>()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3ac4b) [0x7f684ee54c4b] boost::asio::detail::do_throw_error()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3b66b) [0x7f684ee5566b] boost::asio::detail::posix_thread::start_thread()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3bacc) [0x7f684ee55acc] boost::asio::thread_pool::thread_pool()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x946844) [0x7f684e860844] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f684e8608d9] ray::rpc::GetServerCallExecutor()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI_+0x128) [0x7f684e5c3bf8] std::_Function_handler<>::_M_invoke()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8f1) [0x7f684e5fe051] ray::core::CoreWorker::HandleGetCoreWorkerStats()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyEE17HandleRequestImplEv+0x112) [0x7f684e5f4dd2] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x9e9706) [0x7f684e903706] EventTracker::RecordExecution()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x98661e) [0x7f684e8a061e] std::_Function_handler<>::_M_invoke()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0x986b76) [0x7f684e8a0b76] boost::asio::detail::completion_handler<>::do_complete()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf383db) [0x7f684ee523db] boost::asio::detail::scheduler::do_run_one()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf39ea9) [0x7f684ee53ea9] boost::asio::detail::scheduler::run()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf3a362) [0x7f684ee54362] boost::asio::io_context::run()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f684e60b0ed] ray::core::CoreWorker::RunIOService()
/sdf/group/ml/bes_anomalies/conda/envs/raytune3/lib/python3.10/site-packages/ray/_raylet.so(+0xf6af80) [0x7f684ee84f80] execute_native_thread_routine
/lib64/libpthread.so.0(+0x7ea5) [0x7f6856706ea5] start_thread
/lib64/libc.so.6(clone+0x6d) [0x7f6855d26b0d] clone

*** SIGABRT received at time=1694635473 on cpu 11 ***
PC: @     0x7f6855c5e387  (unknown)  raise
    @     0x7f685670e630       1920  (unknown)
    @     0x7f684dbaf35a  (unknown)  __cxxabiv1::__terminate()
    @     0x7f684dbaf580  (unknown)  (unknown)
[2023-09-13 13:04:33,307 E 83474 83879] logging.cc:361: *** SIGABRT received at time=1694635473 on cpu 11 ***
[2023-09-13 13:04:33,307 E 83474 83879] logging.cc:361: PC: @     0x7f6855c5e387  (unknown)  raise
[2023-09-13 13:04:33,307 E 83474 83879] logging.cc:361:     @     0x7f685670e630       1920  (unknown)
[2023-09-13 13:04:33,307 E 83474 83879] logging.cc:361:     @     0x7f684dbaf35a  (unknown)  __cxxabiv1::__terminate()
[2023-09-13 13:04:33,307 E 83474 83879] logging.cc:361:     @     0x7f684dbaf580  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, charset_normalizer.md, grpc._cython.cygrpc, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, pyarrow._json (total: 99)
Aborted

When I try to install ray using conda-forge, I can’t even import ray.

Topic		Replies	Views
Question about Ray Cluster/ Ray on prem Ray Clusters	6	746	June 15, 2021
Unable to run example, returns error message	4	984	March 14, 2023
Ray crashed jupyter notebook Dashboard, Monitoring & Debugging	9	1053	October 2, 2023
ray.tune.error.TuneError: ('Trials did not complete', [A3C_A3C_four_way_train-v0_00000])	0	18	July 13, 2024
Running ray air for pytorch hyperparameter tuning on SLURM cluster Ray Tune	2	1078	February 7, 2023

Getting Started with Ray does not work on any computer I try it

Related topics