Tune: Passing cluster configuration to ray

Hello,

I’m using Tune with PyTorch for hyperparameter tuning in a single node. I’m trying to setup a few ray cluster parameters such as “address” to “localhost” then calling “tune.Tuner()” , which I suppose will set the “–node_ip_address=” option when cluster management processes run.

I’m new to ray and in doubt if “tune.Tuner()” calls “ray.init()” under the hood (I suppose it does). If so, how can I pass cluster parameters down to ray.init(). If not, how can I set which host address to be used when running “tune.Tuner()” ?

I’m using the ray/python/ray/tune/examples/mnist_pytorch.py as my starting point.

Thanks in advance
Lucas

@Brasilino

Tune does call ray.init() under the hood if you don’t initialize it yourself. You can add a ray.init() before calling Tuner.fit() to set some custom cluster parameters!

By the way, if you’re on Ray 2.4+, you should see this in your logs:

Does this log message answer your question?

1 Like

Hello @justinvyu

It does, however I don’t see that messages in logs. Btw, I’m using ray 2.4.0.

Anyway, I added a ray.init() call before turner.fit() in the mnist_pytorch.py script trying to make ray workers to use ‘localhost’ or IP ‘127.0.0.1’ without success. Workers still use the IP from a particular workstation’s network card.

Logs are like:
[2023-04-28 13:38:26,178 I 62180 62180] core_worker.cc:215: Initializing worker at address: AA.BB.CC.DD:45357, worker ID e88f0d5b81566120ca6de321c89b289d5ef19743db7be3fdd9f74b8a, raylet 8817e95f545781e4d2079ff9017de876b28c90b709bd5c827a4bc86a

Where AA.BB.CC.DD is the workstation’s NIC IP.

So, is there a way to force a Ray Tune worker initialization to use localhost or address 127.0.0.1 ?

Thanks again,
Lucas

What are you trying to achieve? Do you want other machines to be able to connect to the machine? If so, you should start ray with ray start --head on the command line and the connect using e.g. ray.init("127.0.0.1:6379")

Hello @kai
I’ll better detail the problem I’m facing. I’m trying to use Ray Tune in a single Workstation (from NVIDIA), however the hyperparameter tuning does not start . As I said, I’m using the ray/python/ray/tune/examples/mnist_pytorch.py script as starting point. Here how I’m running it:

$ python mnist_pytorch.py
2023-05-01 11:12:47,096	INFO worker.py:1625 -- Started a local Ray instance.

And it does not progress from there. No other message in the terminal. I see a lot of ray::IDLE processes along with gcs_server, monitor.py, dashboard.py,raylet,log_monitor.py and agent.py running.
My environment is:

$ python -V
Python 3.8.14
$ python -c "import ray; print(f'Ray version {ray.__version__}')"
Ray version 2.4.0
$ python -c "import torch; print(f'PyTorch version {torch.__version__}')"
PyTorch version 1.12.1+cu113
$ lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.9 (Maipo)
Release:	7.9
Codename:	Maipo

I suspect firewall rules are preventing processes to communicate. I’m not super-user on this workstation neither can add/remove rules to verify if that’s the case. So I’m trying to make all processes bind to the localhost/127.0.0.1 and check if it starts the tuning (I’m positive there’s no rules on the loopback/127.0.0.1 interface). All processes other than dashboard.py bind to the real workstation’s IP (from a network interface).

What would be the best approach to debug this problem ?

The mnist_pytorch.py example worked in two other systems I’m using (a MacOS laptop and another Linux workstation, where I am super-user and there’s no firewall rules).

Thanks in advance,
Lucas

Hello @kai and @justinvyu

After deeper troubleshooting, the problem was related to System will be halted when tasks number is large - Ray Core - Ray.

However, I am still curious about how to make Ray bind to a specific workstation’s IP. Any help is very much appreciated.

Thanks
Lucas