I am once again trying to get Ray working on a cluster I work on. I am trying to look at basic functionality and I can’t even get this “getting started” tutorial to run.
I’ve dumped a copy of the script I am running at the bottom of this post. The cluster uses slurm to provision resources, and I want to walk before I run, so I’m doing this on my MacBook
First, I make a fresh conda environment
conda create --name=raytune python=3.11
/path/to/conda/env/bin/pip install -U "ray[air]"
/path/to/conda/env/bin/pip install torch torchvision torchaudio
All of this goes off without a hitch. Python is version 3.11.5, ray is version 2.6.3, and torch is version 2.0.1.
The first time I ran it (python getting-started.py
), I saw this:
CUDA is available in pytorch: False
<messages about downloading the MNIST data>
2023-09-12 14:59:11,572 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
[2023-09-12 14:59:42,795 E 21473 4607747] core_worker.cc:201: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
The next time I ran it, I saw the following:
CUDA is available in pytorch: False
2023-09-12 15:16:12,792 ERROR node.py:605 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-09-12 15:16:19,767 ERROR node.py:605 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
^CTraceback (most recent call last):
File "python/ray/_raylet.pyx", line 2120, in ray._raylet._auto_reconnect.wrapper
File "python/ray/_raylet.pyx", line 2185, in ray._raylet.GcsClient.internal_kv_get
File "python/ray/_raylet.pyx", line 410, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/foshea/Documents/Projects/raytune/getting-started.py", line 130, in <module>
results = tuner.fit()
^^^^^^^^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 703, in _fit_internal
analysis = run(
^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tune.py", line 573, in run
_ray_auto_init(entrypoint=error_message_map["entrypoint"])
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/tune/tune.py", line 225, in _ray_auto_init
ray.init()
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/worker.py", line 1514, in init
_global_node = ray._private.node.Node(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 287, in __init__
self.start_head_processes()
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 1160, in start_head_processes
self.start_gcs_server()
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 992, in start_gcs_server
self._init_gcs_client()
File "/opt/anaconda3/envs/raytune/lib/python3.11/site-packages/ray/_private/node.py", line 605, in _init_gcs_client
client.internal_kv_get(b"dummy", None)
File "python/ray/_raylet.pyx", line 2140, in ray._raylet._auto_reconnect.wrapper
KeyboardInterrupt
What happens if I run this on a cluster (with a GPU)? This is the 3rd run, I got slightly different errors each time:
CUDA is available in pytorch: True
2023-09-12 14:34:40,342 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2023-09-12 14:34:49,564 INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2023-09-12 14:34:49,579 INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
(pid=44135) [2023-09-12 14:34:50,301 E 44135 44390] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
[2023-09-12 14:34:50,447 E 42180 44126] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
╭────────────────────────────────────────────────────────────────────╮
│ Configuration for experiment train_mnist_2023-09-12_14-34-28 │
├────────────────────────────────────────────────────────────────────┤
│ Search algorithm BasicVariantGenerator │
│ Scheduler FIFOScheduler │
│ Number of trials 1 │
╰────────────────────────────────────────────────────────────────────╯
It seems like the tutorial is missing some critical step to getting ray to work. Any suggestions?
The script:
# https://docs.ray.io/en/latest/tune/getting-started.html
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
from ray import air, tune
from ray.air import session, RunConfig
from ray.tune.search import ConcurrencyLimiter
from ray.tune.schedulers import ASHAScheduler
DATA_DIR = '/Users/foshea/Documents/Projects/raytune/data'
STORAGE_DIR = '/Users/foshea/Documents/Projects/raytune/ray_results'
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
# In this example, we don't change the model architecture
# due to simplicity.
self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
self.fc = nn.Linear(192, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 3))
x = x.view(-1, 192)
x = self.fc(x)
return F.log_softmax(x, dim=1)
# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256
def train(model, optimizer, train_loader):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# We set this just for the example to run quickly.
if batch_idx * len(data) > EPOCH_SIZE:
return
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
def test(model, data_loader):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(data_loader):
# We set this just for the example to run quickly.
if batch_idx * len(data) > TEST_SIZE:
break
data, target = data.to(device), target.to(device)
outputs = model(data)
_, predicted = torch.max(outputs.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
return correct / total
def train_mnist(config):
# Data Setup
mnist_transforms = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.1307, ), (0.3081, ))])
train_loader = DataLoader(
datasets.MNIST(DATA_DIR, train=True, download=True, transform=mnist_transforms),
batch_size=64,
shuffle=True)
test_loader = DataLoader(
datasets.MNIST(DATA_DIR, train=False, transform=mnist_transforms),
batch_size=64,
shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ConvNet()
model.to(device)
optimizer = optim.SGD(
model.parameters(), lr=config["lr"], momentum=config["momentum"])
for i in range(10):
train(model, optimizer, train_loader)
acc = test(model, test_loader)
# Send the current training result back to Tune
session.report({"mean_accuracy": acc})
if i % 5 == 0:
# This saves the model to the trial directory
torch.save(model.state_dict(), "./model.pth")
if __name__ == "__main__":
search_space = {
"lr": tune.sample_from(lambda spec: 10 ** (-10 * np.random.rand())),
"momentum": tune.uniform(0.1, 0.9),
}
print('CUDA is available in pytorch:', torch.cuda.is_available())
# Uncomment this to enable distributed execution
# `ray.init(address="auto")`
# Download the dataset first
datasets.MNIST(DATA_DIR, train=True, download=True)
tuner = tune.Tuner(
train_mnist,
param_space=search_space,
run_config=RunConfig(storage_path=STORAGE_DIR),
# tune_config=tune.TuneConfig(max_concurrent_trials=1)
)
results = tuner.fit()
dfs = {result.log_dir: result.metrics_dataframe for result in results}
[d.mean_accuracy.plot() for d in dfs.values()]