Run multiple tune jobs on one server

richardrl · February 11, 2021, 3:18am

If I do:

ray start --head

Then I run my tune job.

It works one time.

But while that job is running, if I run another job it says:

OSError: [Errno 98] Address already in use

How can I run multiple tune jobs? My machine has plenty of extra resources

sangcho · February 11, 2021, 3:24am

What do you mean by multiple jobs in this case? Did you run ray start --head again?

richardrl · February 12, 2021, 12:18am

No I did not. I mean, run ray start once, run my training script (corresponds to a set of hyperparams). Then run my training script a second time with another set of hyperparams in a new terminal window @sangcho

sangcho · February 12, 2021, 7:02am

Do you have a full stacktrace for the error?

richardrl · February 19, 2021, 12:48am

@sangcho

Traceback (most recent call last):
  File "bandu/torch/train_ray_simple.py", line 841, in <module>
    tune_bandu(num_workers=args.num_workers, use_gpu=args.use_gpu) # If number of GPUs does not match the EC2 GPUs, we may get an error...
  File "bandu/torch/train_ray_simple.py", line 747, in tune_bandu
    queue_trials=True
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/tune.py", line 377, in run
    metric=metric)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 140, in __init__
    self._server = TuneServer(self, self._server_port)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/web_server.py", line 243, in __init__
    self._server = HTTPServer(address, RunnerHandler(runner))
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

richardrl · February 19, 2021, 1:11am

The solution is to change the port in tune.run(…). You can just randomize this port.

Thanks @rliaw @sangcho

richardrl · February 19, 2021, 2:59am

Actually you can’t randomize the port. this happens:

2021-02-18 21:55:37,843 INFO web_server.py:242 -- Starting Tune Server...
500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

sangcho · February 19, 2021, 8:25am

cc @rliaw Is this port confliction from Tuen?

rliaw · February 19, 2021, 8:28am

This seems to be a Wandb error, not a Tune server error…

Topic		Replies	Views
Ray tune Multi-tenancy Ray Tune	2	344	October 5, 2023
Ray exec multiple scripts w/ tune.run() to same ray cluster Ray Tune	18	1450	February 14, 2021
Ray submission server + tune Ray Core	2	291	January 4, 2022
Multiple trials, Tune, and the Autoscaler Ray Tune	2	313	March 3, 2021
Run multiple independent experiments (with slurm) Ray Tune	3	268	December 19, 2023

Run multiple tune jobs on one server

Related topics