The results are different on windows and ubuntu

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

I tried running it“ https://docs.ray.io/en/latest/train/getting-started.html ” on linux, but I get different results than on windows.
The output prompt is as follows:
2023-04-09 12:00:51,004 WARNING worker.py:1829 – Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 655, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 696, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 662, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 666, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 613, in ray._raylet.execute_task.function_executor
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 586, in temporary_actor_method
raise RuntimeError(
RuntimeError: The actor with name TrainTrainable failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

Traceback (most recent call last):
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 625, in _load_actor_class_from_gcs
actor_class = pickle.loads(pickled_class)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/storage.py”, line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 787, in _legacy_load
result = unpickler.load()
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 175, in default_restore_location
result = fn(storage, location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA ’
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.

I added the “_raylet.pyx” file manually, but the run still prompts as above.

what do you mean by different results? Have any of them run through success? Seems to be complaining of dependency error.

It is running successfully on windows.

Seems like you have cuda available on linux. Have you set use_gpu=True in scaling_config?

Yes, I have set it up.

I think there is some issue with current “getting started” code.

Could you try the following script and let me know if it works?

import torch
import torch.nn as nn

num_samples = 20
input_size = 10
layer_size = 15
output_size = 5

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# In this example we use a randomly generated dataset.
input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)

import torch.optim as optim

        
from ray import train

def train_func_distributed():
    input_on_device = input.to(train.torch.get_device())
    labels_on_device = labels.to(train.torch.get_device())
    num_epochs = 3
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input_on_device)
        loss = loss_fn(output, labels_on_device)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")

from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig


trainer = TorchTrainer(
    train_func_distributed,
    scaling_config=ScalingConfig(
        num_workers=4, use_gpu=True)
)

results = trainer.fit()

Oh, thank you very much!!! Now I can run and get the results.

Thanks for confirming it. I will update the code.