Cartpole_server.py with evaluation_interval of 1 leads to Address already in use Error

This issue is similar to a previous post, though my error log is slightly different, so I’ve created this new thread.

This is a minimal reproducible example of the same problem that I’m encountering in my research test bed which has a more complex server/client integration.

Environment

14" MacBook Pro M1
OS X Monterey (12.5.1)
Ray 2.0 installed for Mac silicon

The problem

I am attempting to use the cartpole_server.py example out-of-the-box except run with an evaluation_interval.

Towards this, I have added the entry "evaluation_interval": 1 to the config, and am running the script with cartpole_server.py --num-workers 0 --num-cpus 1 intending to create a single rollout worker attached to this server instance.

Actually, I get the same error with or without adding --num-workers 0 --num-cpus 1, and whether I choose a --framework tensorflow (tf) or PyTorch (torch).

This results in params.json as follows:

{
  "action_space": "Discrete(2)",
  "callbacks": null,
  "env": null,
  "evaluation_interval": 1,
  "framework": "torch",
  "input": "<function _input at 0x14f67af70>",
  "log_level": "INFO",
  "model": {
    "use_lstm": false
  },
  "num_workers": 0,
  "observation_space": "Box([-inf -inf -inf -inf], [inf inf inf inf], (4,), float32)",
  "off_policy_estimation_methods": {},
  "rollout_fragment_length": 1000,
  "train_batch_size": 4000
}

This fails with OSError: [Errno 48] Address already in use. When reviewing the logs, it’s clear that RLlib has built two rollout workers but assigned the same port to each.

Discussion

I have read elsewhere that the server would be expected to fail with --num-workers > 0 with this same error. I would assume setting it to zero would lead to only one rollout worker being created, and since evaluation_num_workers has a default of 0, it should not be creating a new rollout worker for evaluation in this setting.

Full error log

$ /Users/rjf/miniforge3/bin/python /Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py --framework torch --num-workers 0 --num-cpus 1
Running with following CLI args: Namespace(port=9900, callbacks_verbose=False, num_workers=0, no_restore=False, run='PPO', num_cpus=1, framework='torch', use_lstm=False, stop_iters=200, stop_timesteps=500000, stop_reward=80.0, as_test=False, no_tune=False, local_mode=False)
2022-08-29 13:33:42,860 INFO worker.py:1518 -- Started a local Ray instance.
Ignoring restore even if previous checkpoint is provided...
(PPO pid=5915) 2022-08-29 13:33:45,171  INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(PPO pid=5915) 2022-08-29 13:33:45,178  INFO policy.py:939 -- Policy (worker=local) running on CPU.
(PPO pid=5915) 2022-08-29 13:33:45,178  INFO torch_policy_v2.py:98 -- Found 0 visible cuda devices.
(PPO pid=5915) 2022-08-29 13:33:45,181  INFO rollout_worker.py:1802 -- Built policy map: {}
(PPO pid=5915) 2022-08-29 13:33:45,181  INFO rollout_worker.py:1803 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x137c1d5b0>}
(PPO pid=5915) 2022-08-29 13:33:45,181  INFO rollout_worker.py:654 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x137c1d4c0>}
(PPO pid=5915) 2022-08-29 13:33:46,188  INFO policy_server_input.py:154 -- Starting connector server at 1.0.0.127.in-addr.arpa:9900
(PPO pid=5915) 2022-08-29 13:33:46,192  WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(PPO pid=5915) 2022-08-29 13:33:46,192  INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(PPO pid=5915) 2022-08-29 13:33:46,203  INFO policy.py:939 -- Policy (worker=local) running on CPU.
(PPO pid=5915) 2022-08-29 13:33:46,203  INFO torch_policy_v2.py:98 -- Found 0 visible cuda devices.
(PPO pid=5915) 2022-08-29 13:33:46,206  INFO rollout_worker.py:1802 -- Built policy map: {}
(PPO pid=5915) 2022-08-29 13:33:46,206  INFO rollout_worker.py:1803 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x137c9b160>}
(PPO pid=5915) 2022-08-29 13:33:46,206  INFO rollout_worker.py:654 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x137c9b070>}
(PPO pid=5915) Creating a PolicyServer on localhost:9900 failed!
== Status ==
Current time: 2022-08-29 13:33:48 (running for 00:00:04.79)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/10 CPUs, 0/0 GPUs, 0.0/5.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 RUNNING)


2022-08-29 13:33:48,235 ERROR trial_runner.py:980 -- Trial PPO_None_7e06c_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
    super().__init__(config=config, logger_creator=logger_creator, **kwargs)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
    self.evaluation_workers: WorkerSet = WorkerSet(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
    self._local_worker = self._make_worker(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
    worker = cls(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
    self.input_reader: InputReader = input_creator(self.io_context)
  File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
    return PolicyServerInput(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
    HTTPServer.__init__(self, (address, port), handler)
  File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 48] Address already in use

The trial PPO_None_7e06c_00000 errored with parameters={'env': None, 'observation_space': Box([-inf -inf -inf -inf], [inf inf inf inf], (4,), float32), 'action_space': Discrete(2), 'input': <function _input at 0x14f67af70>, 'num_workers': 0, 'off_policy_estimation_methods': {}, 'callbacks': None, 'framework': 'torch', 'log_level': 'INFO', 'model': {'use_lstm': False}, 'evaluation_interval': 1, 'rollout_fragment_length': 1000, 'train_batch_size': 4000}. Error file: /Users/rjf/ray_results/PPO/PPO_None_7e06c_00000_0_2022-08-29_13-33-43/error.txt
== Status ==
Current time: 2022-08-29 13:33:48 (running for 00:00:04.79)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 ERROR)
+----------------------+----------+-------+
| Trial name           | status   | loc   |
|----------------------+----------+-------|
| PPO_None_7e06c_00000 | ERROR    |       |
+----------------------+----------+-------+
Number of errored trials: 1
+----------------------+--------------+---------------------------------------------------------------------------------+
| Trial name           |   # failures | error file                                                                      |
|----------------------+--------------+---------------------------------------------------------------------------------|
| PPO_None_7e06c_00000 |            1 | /Users/rjf/ray_results/PPO/PPO_None_7e06c_00000_0_2022-08-29_13-33-43/error.txt |
+----------------------+--------------+---------------------------------------------------------------------------------+

2022-08-29 13:33:48,242 ERROR ray_trial_executor.py:103 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
    ray.get(future, timeout=0)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/worker.py", line 2277, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
    super().__init__(config=config, logger_creator=logger_creator, **kwargs)
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
    self.evaluation_workers: WorkerSet = WorkerSet(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
    self._local_worker = self._make_worker(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
    worker = cls(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
    self.input_reader: InputReader = input_creator(self.io_context)
  File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
    return PolicyServerInput(
  File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
    HTTPServer.__init__(self, (address, port), handler)
  File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
    self.server_bind()
  File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
    socketserver.TCPServer.server_bind(self)
  File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 48] Address already in use

(PPO pid=5915) 2022-08-29 13:33:48,227  ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
(PPO pid=5915)     super().__init__(config=config, logger_creator=logger_creator, **kwargs)
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
(PPO pid=5915)     self.setup(copy.deepcopy(self.config))
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
(PPO pid=5915)     self.evaluation_workers: WorkerSet = WorkerSet(
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
(PPO pid=5915)     self._local_worker = self._make_worker(
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
(PPO pid=5915)     worker = cls(
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
(PPO pid=5915)     self.input_reader: InputReader = input_creator(self.io_context)
(PPO pid=5915)   File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
(PPO pid=5915)     return PolicyServerInput(
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
(PPO pid=5915)     HTTPServer.__init__(self, (address, port), handler)
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
(PPO pid=5915)     self.server_bind()
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
(PPO pid=5915)     socketserver.TCPServer.server_bind(self)
(PPO pid=5915)   File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
(PPO pid=5915)     self.socket.bind(self.server_address)
(PPO pid=5915) OSError: [Errno 48] Address already in use
2022-08-29 13:33:48,349 ERROR tune.py:754 -- Trials did not complete: [PPO_None_7e06c_00000]
2022-08-29 13:33:48,349 INFO tune.py:758 -- Total run time: 4.91 seconds (4.79 seconds for the tuning loop).

Thank you for reading!

Hi @robfitzgerald,

I am not sure if you can support running an environment on the policy server but if you can then you should be able to get it to work with these settings. At least I was.

 "evaluation_config": {
       "env": "CartPole-v0",
       "input": "sampler"
}

If you want to get your evaluation data from the client then you are going to need to create a new input function with a unique port for that. Something like:

    # `InputReader` generator (returns None if no input reader is needed on
    # the respective worker).
    def _eval_input(ioctx):
        # We are remote worker or we are local worker with num_workers=0:
        # Create a PolicyServerInput.
        if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
            return PolicyServerInput(
                ioctx,
                SERVER_ADDRESS,
                args.port + ioctx.worker_index - (1 if ioctx.worker_index > 0 else 0)+10,
            )
        # No InputReader (PolicyServerInput) needed.
        else:
            return None

...

 "evaluation_config": {
       "input": _eval_input
}
1 Like

that makes sense, i didn’t realize there was an “input” hook in the “evaluation_config” and that it mirrored the API of the PolicyServerInput. i’ll give that a shot and report back here, thanks @mannyv !

@robfitzgerald,

The evaluation workers use the same config you pass in to the trainer. The evaluation_config part of the config lets you provide alternative settings to use during evaluation.