If I do:
ray start --head
Then I run my tune job.
It works one time.
But while that job is running, if I run another job it says:
OSError: [Errno 98] Address already in use
How can I run multiple tune jobs? My machine has plenty of extra resources
If I do:
ray start --head
Then I run my tune job.
It works one time.
But while that job is running, if I run another job it says:
OSError: [Errno 98] Address already in use
How can I run multiple tune jobs? My machine has plenty of extra resources
What do you mean by multiple jobs in this case? Did you run ray start --head again?
No I did not. I mean, run ray start once, run my training script (corresponds to a set of hyperparams). Then run my training script a second time with another set of hyperparams in a new terminal window @sangcho
Do you have a full stacktrace for the error?
Traceback (most recent call last):
  File "bandu/torch/train_ray_simple.py", line 841, in <module>
    tune_bandu(num_workers=args.num_workers, use_gpu=args.use_gpu) # If number of GPUs does not match the EC2 GPUs, we may get an error...
  File "bandu/torch/train_ray_simple.py", line 747, in tune_bandu
    queue_trials=True
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/tune.py", line 377, in run
    metric=metric)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 140, in __init__
    self._server = TuneServer(self, self._server_port)
  File "/home/richard/improbable/venvs/minimum_bandu_venv/lib/python3.6/site-packages/ray/tune/web_server.py", line 243, in __init__
    self._server = HTTPServer(address, RunnerHandler(runner))
  File "/usr/lib/python3.6/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/usr/lib/python3.6/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/usr/lib/python3.6/socketserver.py", line 470, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
            The solution is to change the port in tune.run(…). You can just randomize this port.
Actually you can’t randomize the port. this happens:
2021-02-18 21:55:37,843 INFO web_server.py:242 -- Starting Tune Server...
500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}
500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}
500 response executing GraphQL.
{"error":"Error 1040: Too many connections"}
            cc @rliaw Is this port confliction from Tuen?
This seems to be a Wandb error, not a Tune server error…