The results are different on windows and ubuntu

AnnnxXXx · April 9, 2023, 4:07am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

I tried running it“ https://docs.ray.io/en/latest/train/getting-started.html ” on linux, but I get different results than on windows.
The output prompt is as follows:
2023-04-09 12:00:51,004 WARNING worker.py:1829 – Traceback (most recent call last):
File “python/ray/_raylet.pyx”, line 655, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 696, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 662, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 666, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 613, in ray._raylet.execute_task.function_executor
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 586, in temporary_actor_method
raise RuntimeError(
RuntimeError: The actor with name TrainTrainable failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

Traceback (most recent call last):
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 625, in _load_actor_class_from_gcs
actor_class = pickle.loads(pickled_class)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/storage.py”, line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 787, in _legacy_load
result = unpickler.load()
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 175, in default_restore_location
result = fn(storage, location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File “/root/miniconda3/envs/pytorch1/lib/python3.8/site-packages/torch/serialization.py”, line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA ’
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.

AnnnxXXx · April 9, 2023, 6:33am

I added the “_raylet.pyx” file manually, but the run still prompts as above.

xwjiang2010 · April 9, 2023, 2:53pm

what do you mean by different results? Have any of them run through success? Seems to be complaining of dependency error.

AnnnxXXx · April 9, 2023, 2:55pm

It is running successfully on windows.

xwjiang2010 · April 10, 2023, 4:22pm

Seems like you have cuda available on linux. Have you set use_gpu=True in scaling_config?

AnnnxXXx · April 10, 2023, 4:42pm

Yes, I have set it up.

xwjiang2010 · April 10, 2023, 6:18pm

I think there is some issue with current “getting started” code.

Could you try the following script and let me know if it works?

import torch
import torch.nn as nn

num_samples = 20
input_size = 10
layer_size = 15
output_size = 5

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# In this example we use a randomly generated dataset.
input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)

import torch.optim as optim

        
from ray import train

def train_func_distributed():
    input_on_device = input.to(train.torch.get_device())
    labels_on_device = labels.to(train.torch.get_device())
    num_epochs = 3
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input_on_device)
        loss = loss_fn(output, labels_on_device)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"epoch: {epoch}, loss: {loss.item()}")

from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig


trainer = TorchTrainer(
    train_func_distributed,
    scaling_config=ScalingConfig(
        num_workers=4, use_gpu=True)
)

results = trainer.fit()

AnnnxXXx · April 11, 2023, 1:44am

Oh, thank you very much!!! Now I can run and get the results.

xwjiang2010 · April 11, 2023, 2:45am

Thanks for confirming it. I will update the code.

Topic		Replies	Views
RuntimeError: The actor with name TrainTrainable failed to import on the worker Ray Core	4	1371	November 15, 2022
Error occuring repeatedly during training at random time steps RLlib	3	52	January 3, 2025
Ray train examples are broken Ray Train	1	598	May 10, 2022
RuntimeError: Expected scalars to be on CPU, got cuda:0 instead RLlib	7	736	May 23, 2023
Error: RuntimeError: No rendezvous handler for env:// Ray Train	5	810	April 5, 2023

The results are different on windows and ubuntu

Related topics