This issue is similar to a previous post, though my error log is slightly different, so I’ve created this new thread.
This is a minimal reproducible example of the same problem that I’m encountering in my research test bed which has a more complex server/client integration.
Environment
14" MacBook Pro M1
OS X Monterey (12.5.1)
Ray 2.0 installed for Mac silicon
The problem
I am attempting to use the cartpole_server.py example out-of-the-box except run with an evaluation_interval
.
Towards this, I have added the entry "evaluation_interval": 1
to the config, and am running the script with cartpole_server.py --num-workers 0 --num-cpus 1
intending to create a single rollout worker attached to this server instance.
Actually, I get the same error with or without adding --num-workers 0 --num-cpus 1
, and whether I choose a --framework
tensorflow (tf
) or PyTorch (torch
).
This results in params.json as follows:
{
"action_space": "Discrete(2)",
"callbacks": null,
"env": null,
"evaluation_interval": 1,
"framework": "torch",
"input": "<function _input at 0x14f67af70>",
"log_level": "INFO",
"model": {
"use_lstm": false
},
"num_workers": 0,
"observation_space": "Box([-inf -inf -inf -inf], [inf inf inf inf], (4,), float32)",
"off_policy_estimation_methods": {},
"rollout_fragment_length": 1000,
"train_batch_size": 4000
}
This fails with OSError: [Errno 48] Address already in use
. When reviewing the logs, it’s clear that RLlib has built two rollout workers but assigned the same port to each.
Discussion
I have read elsewhere that the server would be expected to fail with --num-workers
> 0 with this same error. I would assume setting it to zero would lead to only one rollout worker being created, and since evaluation_num_workers
has a default of 0
, it should not be creating a new rollout worker for evaluation in this setting.
Full error log
$ /Users/rjf/miniforge3/bin/python /Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py --framework torch --num-workers 0 --num-cpus 1
Running with following CLI args: Namespace(port=9900, callbacks_verbose=False, num_workers=0, no_restore=False, run='PPO', num_cpus=1, framework='torch', use_lstm=False, stop_iters=200, stop_timesteps=500000, stop_reward=80.0, as_test=False, no_tune=False, local_mode=False)
2022-08-29 13:33:42,860 INFO worker.py:1518 -- Started a local Ray instance.
Ignoring restore even if previous checkpoint is provided...
(PPO pid=5915) 2022-08-29 13:33:45,171 INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(PPO pid=5915) 2022-08-29 13:33:45,178 INFO policy.py:939 -- Policy (worker=local) running on CPU.
(PPO pid=5915) 2022-08-29 13:33:45,178 INFO torch_policy_v2.py:98 -- Found 0 visible cuda devices.
(PPO pid=5915) 2022-08-29 13:33:45,181 INFO rollout_worker.py:1802 -- Built policy map: {}
(PPO pid=5915) 2022-08-29 13:33:45,181 INFO rollout_worker.py:1803 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x137c1d5b0>}
(PPO pid=5915) 2022-08-29 13:33:45,181 INFO rollout_worker.py:654 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x137c1d4c0>}
(PPO pid=5915) 2022-08-29 13:33:46,188 INFO policy_server_input.py:154 -- Starting connector server at 1.0.0.127.in-addr.arpa:9900
(PPO pid=5915) 2022-08-29 13:33:46,192 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(PPO pid=5915) 2022-08-29 13:33:46,192 INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
(PPO pid=5915) 2022-08-29 13:33:46,203 INFO policy.py:939 -- Policy (worker=local) running on CPU.
(PPO pid=5915) 2022-08-29 13:33:46,203 INFO torch_policy_v2.py:98 -- Found 0 visible cuda devices.
(PPO pid=5915) 2022-08-29 13:33:46,206 INFO rollout_worker.py:1802 -- Built policy map: {}
(PPO pid=5915) 2022-08-29 13:33:46,206 INFO rollout_worker.py:1803 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x137c9b160>}
(PPO pid=5915) 2022-08-29 13:33:46,206 INFO rollout_worker.py:654 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x137c9b070>}
(PPO pid=5915) Creating a PolicyServer on localhost:9900 failed!
== Status ==
Current time: 2022-08-29 13:33:48 (running for 00:00:04.79)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/10 CPUs, 0/0 GPUs, 0.0/5.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 RUNNING)
2022-08-29 13:33:48,235 ERROR trial_runner.py:980 -- Trial PPO_None_7e06c_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
future_result = ray.get(ready_future)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
super().__init__(config=config, logger_creator=logger_creator, **kwargs)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
self.setup(copy.deepcopy(self.config))
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
self.evaluation_workers: WorkerSet = WorkerSet(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
self._local_worker = self._make_worker(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
worker = cls(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
self.input_reader: InputReader = input_creator(self.io_context)
File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
return PolicyServerInput(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
HTTPServer.__init__(self, (address, port), handler)
File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
self.server_bind()
File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 48] Address already in use
The trial PPO_None_7e06c_00000 errored with parameters={'env': None, 'observation_space': Box([-inf -inf -inf -inf], [inf inf inf inf], (4,), float32), 'action_space': Discrete(2), 'input': <function _input at 0x14f67af70>, 'num_workers': 0, 'off_policy_estimation_methods': {}, 'callbacks': None, 'framework': 'torch', 'log_level': 'INFO', 'model': {'use_lstm': False}, 'evaluation_interval': 1, 'rollout_fragment_length': 1000, 'train_batch_size': 4000}. Error file: /Users/rjf/ray_results/PPO/PPO_None_7e06c_00000_0_2022-08-29_13-33-43/error.txt
== Status ==
Current time: 2022-08-29 13:33:48 (running for 00:00:04.79)
Memory usage on this node: 10.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 ERROR)
+----------------------+----------+-------+
| Trial name | status | loc |
|----------------------+----------+-------|
| PPO_None_7e06c_00000 | ERROR | |
+----------------------+----------+-------+
Number of errored trials: 1
+----------------------+--------------+---------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|----------------------+--------------+---------------------------------------------------------------------------------|
| PPO_None_7e06c_00000 | 1 | /Users/rjf/ray_results/PPO/PPO_None_7e06c_00000_0_2022-08-29_13-33-43/error.txt |
+----------------------+--------------+---------------------------------------------------------------------------------+
2022-08-29 13:33:48,242 ERROR ray_trial_executor.py:103 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
ray.get(future, timeout=0)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
super().__init__(config=config, logger_creator=logger_creator, **kwargs)
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
self.setup(copy.deepcopy(self.config))
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
self.evaluation_workers: WorkerSet = WorkerSet(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
self._local_worker = self._make_worker(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
worker = cls(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
self.input_reader: InputReader = input_creator(self.io_context)
File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
return PolicyServerInput(
File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
HTTPServer.__init__(self, (address, port), handler)
File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
self.server_bind()
File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
socketserver.TCPServer.server_bind(self)
File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 48] Address already in use
(PPO pid=5915) 2022-08-29 13:33:48,227 ERROR worker.py:756 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5915, ip=127.0.0.1, repr=PPO)
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 308, in __init__
(PPO pid=5915) super().__init__(config=config, logger_creator=logger_creator, **kwargs)
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 157, in __init__
(PPO pid=5915) self.setup(copy.deepcopy(self.config))
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py", line 543, in setup
(PPO pid=5915) self.evaluation_workers: WorkerSet = WorkerSet(
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 171, in __init__
(PPO pid=5915) self._local_worker = self._make_worker(
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 661, in _make_worker
(PPO pid=5915) worker = cls(
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 746, in __init__
(PPO pid=5915) self.input_reader: InputReader = input_creator(self.io_context)
(PPO pid=5915) File "/Users/rjf/dev/external/ray/rllib/examples/serving/cartpole_server.py", line 144, in _input
(PPO pid=5915) return PolicyServerInput(
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 146, in __init__
(PPO pid=5915) HTTPServer.__init__(self, (address, port), handler)
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 452, in __init__
(PPO pid=5915) self.server_bind()
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/http/server.py", line 136, in server_bind
(PPO pid=5915) socketserver.TCPServer.server_bind(self)
(PPO pid=5915) File "/Users/rjf/miniforge3/lib/python3.9/socketserver.py", line 466, in server_bind
(PPO pid=5915) self.socket.bind(self.server_address)
(PPO pid=5915) OSError: [Errno 48] Address already in use
2022-08-29 13:33:48,349 ERROR tune.py:754 -- Trials did not complete: [PPO_None_7e06c_00000]
2022-08-29 13:33:48,349 INFO tune.py:758 -- Total run time: 4.91 seconds (4.79 seconds for the tuning loop).
Thank you for reading!