Tune: Passing cluster configuration to ray

Brasilino · April 27, 2023, 4:04pm

Hello,

I’m using Tune with PyTorch for hyperparameter tuning in a single node. I’m trying to setup a few ray cluster parameters such as “address” to “localhost” then calling “tune.Tuner()” , which I suppose will set the “–node_ip_address=” option when cluster management processes run.

I’m new to ray and in doubt if “tune.Tuner()” calls “ray.init()” under the hood (I suppose it does). If so, how can I pass cluster parameters down to ray.init(). If not, how can I set which host address to be used when running “tune.Tuner()” ?

I’m using the ray/python/ray/tune/examples/mnist_pytorch.py as my starting point.

Thanks in advance
Lucas

justinvyu · April 27, 2023, 6:17pm

@Brasilino

Tune does call ray.init() under the hood if you don’t initialize it yourself. You can add a ray.init() before calling Tuner.fit() to set some custom cluster parameters!

By the way, if you’re on Ray 2.4+, you should see this in your logs:

ray/tune.py at aeed2b3c58db4a505f2a5b5502e41898290effdb · ray-project/ray · GitHub

Does this log message answer your question?

Brasilino · April 28, 2023, 5:42pm

Hello @justinvyu

It does, however I don’t see that messages in logs. Btw, I’m using ray 2.4.0.

Anyway, I added a ray.init() call before turner.fit() in the mnist_pytorch.py script trying to make ray workers to use ‘localhost’ or IP ‘127.0.0.1’ without success. Workers still use the IP from a particular workstation’s network card.

Logs are like:
[2023-04-28 13:38:26,178 I 62180 62180] core_worker.cc:215: Initializing worker at address: AA.BB.CC.DD:45357, worker ID e88f0d5b81566120ca6de321c89b289d5ef19743db7be3fdd9f74b8a, raylet 8817e95f545781e4d2079ff9017de876b28c90b709bd5c827a4bc86a

Where AA.BB.CC.DD is the workstation’s NIC IP.

So, is there a way to force a Ray Tune worker initialization to use localhost or address 127.0.0.1 ?

Thanks again,
Lucas

kai · May 1, 2023, 10:12am

What are you trying to achieve? Do you want other machines to be able to connect to the machine? If so, you should start ray with ray start --head on the command line and the connect using e.g. ray.init("127.0.0.1:6379")

Brasilino · May 1, 2023, 3:44pm

Hello @kai
I’ll better detail the problem I’m facing. I’m trying to use Ray Tune in a single Workstation (from NVIDIA), however the hyperparameter tuning does not start . As I said, I’m using the ray/python/ray/tune/examples/mnist_pytorch.py script as starting point. Here how I’m running it:

$ python mnist_pytorch.py
2023-05-01 11:12:47,096	INFO worker.py:1625 -- Started a local Ray instance.

And it does not progress from there. No other message in the terminal. I see a lot of ray::IDLE processes along with gcs_server, monitor.py, dashboard.py,raylet,log_monitor.py and agent.py running.
My environment is:

$ python -V
Python 3.8.14
$ python -c "import ray; print(f'Ray version {ray.__version__}')"
Ray version 2.4.0
$ python -c "import torch; print(f'PyTorch version {torch.__version__}')"
PyTorch version 1.12.1+cu113
$ lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	RedHatEnterpriseServer
Description:	Red Hat Enterprise Linux Server release 7.9 (Maipo)
Release:	7.9
Codename:	Maipo

I suspect firewall rules are preventing processes to communicate. I’m not super-user on this workstation neither can add/remove rules to verify if that’s the case. So I’m trying to make all processes bind to the localhost/127.0.0.1 and check if it starts the tuning (I’m positive there’s no rules on the loopback/127.0.0.1 interface). All processes other than dashboard.py bind to the real workstation’s IP (from a network interface).

What would be the best approach to debug this problem ?

The mnist_pytorch.py example worked in two other systems I’m using (a MacOS laptop and another Linux workstation, where I am super-user and there’s no firewall rules).

Thanks in advance,
Lucas

Brasilino · May 3, 2023, 7:17pm

Hello @kai and @justinvyu

After deeper troubleshooting, the problem was related to System will be halted when tasks number is large - Ray Core - Ray.

However, I am still curious about how to make Ray bind to a specific workstation’s IP. Any help is very much appreciated.

Thanks
Lucas

Topic		Replies	Views
Running ray tune on an already existing cluster Ray Tune	1	331	February 19, 2022
Question about Ray Cluster/ Ray on prem Ray Clusters	6	746	June 15, 2021
Getting Started with Ray does not work on any computer I try it Ray Tune	4	2428	September 13, 2023
Ray status can connect but client code cannot Ray Clusters	10	1266	August 3, 2021
Unable to run example, returns error message	4	980	March 14, 2023

Tune: Passing cluster configuration to ray

Related topics