Client server timeout

rusu24edward · June 28, 2022, 1:09am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I have a client server training setup for a simple game. Only one client, and the server and client are on the same machine. For about two hours, everything is chugging along smoothly with the client and server exchanging data and the trainer training the agent. My output from the server looks like this:

2022-06-27 14:05:25,047	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:05:25] "POST / HTTP/1.1" 200 -
2022-06-27 14:05:35,091	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:05:35] "POST / HTTP/1.1" 200 -
2022-06-27 14:05:41,121	INFO policy_server_input.py:206 -- Got sample batch of size 20 from client.
127.0.0.1 - - [27/Jun/2022 14:05:41] "POST / HTTP/1.1" 200 -
2022-06-27 14:05:45,140	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:05:45] "POST / HTTP/1.1" 200 -
2022-06-27 14:05:55,182	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:05:55] "POST / HTTP/1.1" 200 -
2022-06-27 14:06:01,212	INFO policy_server_input.py:206 -- Got sample batch of size 20 from client.
127.0.0.1 - - [27/Jun/2022 14:06:01] "POST / HTTP/1.1" 200 -
2022-06-27 14:06:05,231	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:06:05] "POST / HTTP/1.1" 200 -
2022-06-27 14:06:15,274	INFO policy_server_input.py:200 -- Sending worker weights to client.
127.0.0.1 - - [27/Jun/2022 14:06:15] "POST / HTTP/1.1" 200 -
2022-06-27 14:06:21,306	INFO policy_server_input.py:206 -- Got sample batch of size 20 from client.
127.0.0.1 - - [27/Jun/2022 14:06:21] "POST / HTTP/1.1" 200 -
agent_timesteps_total: 7800
custom_metrics: {}
date: 2022-06-27_14-06-21
done: false
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
experiment_id: 80696cf9082541d5b48af06c816b5995
hostname: pascal6
info:
  learner:
    maze:
      custom_metrics: {}
      learner_stats:
        cur_lr: 9.999999747378752e-05
        entropy_coeff: 0.009999999776482582
        grad_gnorm: 11.02481460571289
        model: {}
        policy_entropy: 307.48297119140625
        policy_loss: -2.469358205795288
        var_gnorm: 22.7189884185791
        vf_explained_var: -0.4804614782333374
        vf_loss: 0.0022206767462193966
      num_agent_steps_trained: 50.0
  num_agent_steps_sampled: 7800
  num_agent_steps_trained: 7800
  num_steps_sampled: 7800
  num_steps_trained: 7800
  num_steps_trained_this_iter: 60
iterations_since_restore: 130
node_ip: 192.168.128.6
num_healthy_workers: 0
off_policy_estimator: {}
perf:
  cpu_util_percent: 0.016279069767441864
  ram_util_percent: 2.9
pid: 29528
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
time_since_restore: 7787.611072063446
time_this_iter_s: 59.59892225265503
time_total_s: 7787.611072063446
timers:
  learn_throughput: 17040.434
  learn_time_ms: 3.521
  load_throughput: 379231.826
  load_time_ms: 0.158
  sample_throughput: 0.997
  sample_time_ms: 60182.14
timestamp: 1656363981
timesteps_since_restore: 7800
timesteps_this_iter: 60
timesteps_total: 7800
training_iteration: 130
trial_id: default
warmup_time: 2.693452835083008

and from the client:

...
2022-06-27 14:06:05,233	INFO policy_client.py:279 -- Updating rollout worker weights and global vars {'timestep': 7740}.
2022-06-27 14:06:15,270	INFO policy_client.py:269 -- Querying server for new policy weights.
2022-06-27 14:06:15,278	INFO policy_client.py:279 -- Updating rollout worker weights and global vars {'timestep': 7740}.
2022-06-27 14:06:21,302	INFO policy_client.py:304 -- Sending batch of 20 env steps (20 agent steps) to server.
2022-06-27 14:06:21,308	INFO policy_client.py:298 -- Generating new batch of experiences.

Then, out of the blue, the server dies. No warning, no errors. The process just ends. Of course the client complains about not being able to connect and crashes with an error. The way the server ends, it seems like it thinks it has finished the job, but I did not specify an end time for the server. I told it to just run forever. Here is my server script:

policies = {
    'maze': (
        None,
        Box(low=np.array([0, 0]), high=np.array([2100, 2100])),
        Box(-100, 100, (2,)),
        {}
    ),
}

def policy_mapping_fn(agent_id):
    return "maze"


server_address = "localhost"
server_port = 9900

class PolicyServerInputMA(PolicyServerInput):
    '''
    For multi-agetnt server use this class or alternatively set the 'idle_timeout'
    argument to a large value
    '''
    def _put_empty_sample_batch_every_n_sec(self):
        # The original method places an empty SampleBatch every `idle_timeout` seconds onto the
        # sample queue to avoid hanging parallel RolloutWorkers. The problem is that in multi-agent
        # environment those empty SampleBatch objects run into conflict with MultiAgentBatch
        # objects placed by the client. Either use this class or set 'idle_timeout' to a large number
        # in multi-agent envs.
        while True:
            time.sleep(self.idle_timeout)
            self.samples_queue.put(MultiAgentBatch({}, 0))

def _input(ioctx):
    # We are remote worker or we are local worker with num_workers=0:
    # Create a PolicyServerInput.
    if ioctx.worker_index > 0 or ioctx.worker.num_workers == 0:
        return PolicyServerInputMA(
            ioctx,
            server_address,
            server_port,
            idle_timeout=3.0
        )
    # No InputReader (PolicyServerInput) needed.
    else:
        return None

config = {
    # Use the connector server to generate experiences.
    "input": _input,
    "env": None,
    "disable_env_checking": True,
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # "count_steps_by": "agent_steps"
    },
    "num_gpus": 0,
    # "framework":  "tf",
    "num_workers": 0,
    "log_level": "INFO",
    # Disable OPE, since the rollouts are coming from online clients.
    "input_evaluation": [],
    "train_batch_size": 50,
    "rollout_fragment_length": 20,
    # "batch_mode": "truncate_episodes",
}

trainer = a3c.a2c.A2CTrainer(config=config)
# Serving and training loop.
i = 0
# Since this is while true, I expect the server to run forever until I send a kill signal
while True:
    results = trainer.train()
    trainer.export_policy_checkpoint(
        f"checkpoint{i}",
        policy_id="maze"
    )
    print(pretty_print(results))
    i += 1

Is there a hidden parameter somewhere that is causing the trainer to end early?
Thanks!

arturn · June 28, 2022, 12:58pm

This obviously should not happen but is hard to debug from here.
The Server should send an error via HTTP and log it as well.

rusu24edward · June 28, 2022, 4:57pm

I’ll see if I can reproduce the issue with CartPole or some other simple environment.

rusu24edward · June 30, 2022, 2:04am

False alarm, the server loop was naturally breaking on a condition I forgot about. Thanks for the quick response, nonetheless.

Topic		Replies	Views
Worker Timeout and restart RLlib	0	435	February 15, 2022
Alert `Policy_Client`s when `Policy_Server` completes an epoch RLlib	2	225	March 30, 2023
Parallel workaround with client server RLlib	3	381	June 30, 2022
Increase wait time for trainer when using PolicyServer data input RLlib	0	200	May 17, 2024
Trainers with policy client/server(s) on a single machine locks at .train(), is multi-thread explicitly needed? RLlib	3	273	July 26, 2022

Client server timeout

Related topics