Evaluation in Serving

Hello, I am trying to figure out how to do evaluation runs with RLLib Serving. I am trying to follow this tutorial, specifically CartPole example. But I am not quite sure how to do evaluation runs.

One problem is whenever I set “evaluation_interval” in the config for tune.run, I get port already in “use error”.

Also, do I need to have specific serving clients with no_train=True to be used for evaluation?
Thanks in advance!

Specifically, I get this wacky error message when I am trying to run it

(pid=10029) 2022-01-27 11:53:06,531     INFO rollout_worker.py:1387 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.DictFlatteningPreprocessor object at 0x7f2b8e465280>}
(pid=10029) 2022-01-27 11:53:06,531     INFO rollout_worker.py:614 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7f2b8e449910>}
(pid=10029) 2022-01-27 11:53:06,536     ERROR worker.py:428 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=10029, ip=172.17.0.6)
== Status ==
Memory usage on this node: 38.7/251.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/48 CPUs, 0/1 GPUs, 0.0/165.93 GiB heap, 0.0/75.1 GiB objects
Result logdir: /home/user/aurora-rl/src/outputs/2022-01-27/11-52-33/results/PPO
Number of trials: 1/1 (1 ERROR)
+----------------------+----------+-------+
| Trial name           | status   | loc   |
|----------------------+----------+-------|
| PPO_None_a8cae_00000 | ERROR    |       |
+----------------------+----------+-------+
Number of errored trials: 1
+----------------------+--------------+-----------------------------------------------------------------------------------------------------------------------+
| Trial name           |   # failures | error file                                                                                                            |
|----------------------+--------------+-----------------------------------------------------------------------------------------------------------------------|
| PPO_None_a8cae_00000 |            1 | /home/user/aurora-rl/src/outputs/2022-01-27/11-52-33/results/PPO/PPO_None_a8cae_00000_0_2022-01-27_11-52-54/error.txt |
+----------------------+--------------+-----------------------------------------------------------------------------------------------------------------------+

(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/agents/trainer_template.py", line 137, in __init__
(pid=10029)     Trainer.__init__(self, config, env, logger_creator)
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 611, in __init__
(pid=10029)     super().__init__(config, logger_creator)
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/tune/trainable.py", line 106, in __init__
(pid=10029)     self.setup(copy.deepcopy(self.config))
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/agents/trainer_template.py", line 147, in setup
(pid=10029)     super().setup(config)
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 793, in setup
(pid=10029)     self.evaluation_workers = self._make_workers(
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 846, in _make_workers
(pid=10029)     return WorkerSet(
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 103, in __init__
(pid=10029)     self._local_worker = self._make_worker(
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/worker_set.py", line 399, in _make_worker
(pid=10029)     worker = cls(
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py", line 720, in __init__
(pid=10029)     self.input_reader: InputReader = input_creator(self.io_context)
(pid=10029)   File "/home/user/aurora-rl/src/rllib_drones/server.py", line 37, in _input
(pid=10029)     return PolicyServerInput(
(pid=10029)   File "/home/user/miniconda/lib/python3.9/site-packages/ray/rllib/env/policy_server_input.py", line 92, in __init__
(pid=10029)     HTTPServer.__init__(self, (address, port), handler)
(pid=10029)   File "/home/user/miniconda/lib/python3.9/socketserver.py", line 452, in __init__
(pid=10029)     self.server_bind()
(pid=10029)   File "/home/user/miniconda/lib/python3.9/http/server.py", line 138, in server_bind
(pid=10029)     socketserver.TCPServer.server_bind(self)
(pid=10029)   File "/home/user/miniconda/lib/python3.9/socketserver.py", line 466, in server_bind
(pid=10029)     self.socket.bind(self.server_address)
(pid=10029) OSError: [Errno 98] Address already in use
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/site-packages/clearml/binding/hydra_bind.py", line 146, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "/home/user/aurora-rl/src/rllib_drones/server.py", line 94, in main
    analysis = tune.run(
  File "/home/user/miniconda/lib/python3.9/site-packages/ray/tune/tune.py", line 611, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO_None_a8cae_00000])

Hi @awarebayes,

welcome to the forum! So, this error looks rather non RLlib related, but due to the fact that the port 9900 that should be used by the server is already in use on your computer. Probably this was caused by some processes that errored out earlier. You could use sudo fuser 9900/tcp to check if there might be some occupation of this port. If there is and your server is not running it is probably from older crashed processes. You can then kill these by using the -k option: sudo fuser -k 9900/tcp.

Hope this helps.

1 Like