Run multiple tune jobs on one server

If I do:

ray start --head

Then I run my tune job.

It works one time.

But while that job is running, if I run another job it says:

OSError: [Errno 98] Address already in use

How can I run multiple tune jobs? My machine has plenty of extra resources

What do you mean by multiple jobs in this case? Did you run ray start --head again?

1 Like

No I did not. I mean, run ray start once, run my training script (corresponds to a set of hyperparams). Then run my training script a second time with another set of hyperparams in a new terminal window @sangcho

Do you have a full stacktrace for the error?

@sangcho

Traceback (most recent call last):
  File "bandu/torch/train_ray_simple.py", line 841, in <module>
    tune_bandu(num_workers=args.num_workers, use_gpu=args.use_gpu) # If number of GPUs does not match the EC2 GPUs, we may get an error...
  File "bandu/torch/train_ray_simple.py", line 747, in tune_bandu
    queue_trials=True
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/tune.py", line 377, in run
    metric=metric)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 140, in __init__
    self._server = TuneServer(self, self._server_port)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/web_server.py", line 243, in __init__
    self._server = HTTPServer(address, RunnerHandler(runner))
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use

The solution is to change the port in tune.run(…). You can just randomize this port.

Thanks @rliaw @sangcho

1 Like

Actually you can’t randomize the port. this happens:

2021-02-18 21:55:37,843 INFO web_server.py:242 -- Starting Tune Server...
500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}

cc @rliaw Is this port confliction from Tuen?

This seems to be a Wandb error, not a Tune server error…