I’m using Tune with PyTorch for hyperparameter tuning in a single node. I’m trying to setup a few ray cluster parameters such as “address” to “localhost” then calling “tune.Tuner()” , which I suppose will set the “–node_ip_address=” option when cluster management processes run.
I’m new to ray and in doubt if “tune.Tuner()” calls “ray.init()” under the hood (I suppose it does). If so, how can I pass cluster parameters down to ray.init(). If not, how can I set which host address to be used when running “tune.Tuner()” ?
I’m using the ray/python/ray/tune/examples/mnist_pytorch.py as my starting point.
Tune does call ray.init() under the hood if you don’t initialize it yourself. You can add a ray.init() before calling Tuner.fit() to set some custom cluster parameters!
By the way, if you’re on Ray 2.4+, you should see this in your logs:
It does, however I don’t see that messages in logs. Btw, I’m using ray 2.4.0.
Anyway, I added a ray.init() call before turner.fit() in the mnist_pytorch.py script trying to make ray workers to use ‘localhost’ or IP ‘127.0.0.1’ without success. Workers still use the IP from a particular workstation’s network card.
Logs are like: [2023-04-28 13:38:26,178 I 62180 62180] core_worker.cc:215: Initializing worker at address: AA.BB.CC.DD:45357, worker ID e88f0d5b81566120ca6de321c89b289d5ef19743db7be3fdd9f74b8a, raylet 8817e95f545781e4d2079ff9017de876b28c90b709bd5c827a4bc86a
Where AA.BB.CC.DD is the workstation’s NIC IP.
So, is there a way to force a Ray Tune worker initialization to use localhost or address 127.0.0.1 ?
What are you trying to achieve? Do you want other machines to be able to connect to the machine? If so, you should start ray with ray start --head on the command line and the connect using e.g. ray.init("127.0.0.1:6379")
Hello @kai
I’ll better detail the problem I’m facing. I’m trying to use Ray Tune in a single Workstation (from NVIDIA), however the hyperparameter tuning does not start . As I said, I’m using the ray/python/ray/tune/examples/mnist_pytorch.py script as starting point. Here how I’m running it:
$ python mnist_pytorch.py
2023-05-01 11:12:47,096 INFO worker.py:1625 -- Started a local Ray instance.
And it does not progress from there. No other message in the terminal. I see a lot of ray::IDLE processes along with gcs_server, monitor.py, dashboard.py,raylet,log_monitor.py and agent.py running.
My environment is:
$ python -V
Python 3.8.14
$ python -c "import ray; print(f'Ray version {ray.__version__}')"
Ray version 2.4.0
$ python -c "import torch; print(f'PyTorch version {torch.__version__}')"
PyTorch version 1.12.1+cu113
$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.9 (Maipo)
Release: 7.9
Codename: Maipo
I suspect firewall rules are preventing processes to communicate. I’m not super-user on this workstation neither can add/remove rules to verify if that’s the case. So I’m trying to make all processes bind to the localhost/127.0.0.1 and check if it starts the tuning (I’m positive there’s no rules on the loopback/127.0.0.1 interface). All processes other than dashboard.py bind to the real workstation’s IP (from a network interface).
What would be the best approach to debug this problem ?
The mnist_pytorch.py example worked in two other systems I’m using (a MacOS laptop and another Linux workstation, where I am super-user and there’s no firewall rules).