Deployment - Stuck on compute action

I have been having some trouble when deploying my agent to production. My aim is to build an API that I can call from another source. My code is:

from flask import Flask
from flask_restful import Api, Resource, reqparse
import numpy as np
import ray
from ray.tune.registry import register_env
import ray.tune as tune
from batt_env import BatteryEnv
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
import requests
from flask import request
from datetime import datetime


print(f"{datetime.now()}: about to init ray", flush=True)
ray.init(ignore_reinit_error=True, log_to_driver=False)

app = Flask(__name__)
API = Api(app)

print(f"{datetime.now()}: about to load stacked", flush=True)

data = np.load('/home/carterb/mysite/stacked.npy')
data = data/8000

batt = BatteryEnv(data, power=20, capacity=20, initial_charge=0, bleed=0.1,
                  starting_temperature=23, temp_change=1, cooldown_rate=1, efficiency=1.0, cycle_cost=0)

print(f"{datetime.now()}: about to do ", flush=True)

def env_creator(env_config):
    return BatteryEnv(data, power=20, capacity=20, initial_charge=0, bleed=0.1, starting_temperature=23, temp_change=1, cooldown_rate=1, efficiency=1.0, cycle_cost=0)

print(f"{datetime.now()}: about to do create env ", flush=True)

register_env("battery", env_creator)

config = {
    "env": "battery",
    "num_workers": 1,
    'explore': False,
    'log_level': 'DEBUG'
}
print(f"{datetime.now()}: about to do restore agent ", flush=True)

trained_trainer = PPOTrainer(config, 'battery')

trained_trainer.restore("/home/carterb/mysite/checkpoint-12000")

print(f"{datetime.now()}: about to do gen predict class ", flush=True)

class Predict(Resource):

    @staticmethod
    def post():
        # parser = reqparse.RequestParser()
        # parser.add_argument('obs')
        print(f"{datetime.now()}: about to accept input ", flush=True)

        input_data = request.json
        arr = np.array(input_data['obs'])
        # print(arr.shape)
        # print(input_data)
        #
        # args = parser.parse_args()  # creates dict
        #
        # print(args)

        # X_new = np.fromiter(args.values(), dtype=float)
        # print(X_new.shape)
        state = []
        done = False  # convert input to array
        print(f"{datetime.now()}: about to do compute action class ", flush=True)

        action = trained_trainer.compute_action(arr)
        print(f"{datetime.now()}: about to return action ", flush=True)

        # print('action: ' + str(action))
        out = {'Action': int(action)}
        #
        return out, 200

print(f"{datetime.now()}: about to do add resource  ", flush=True)

API.add_resource(Predict, '/predict', methods=['POST'])
print(f"{datetime.now()}: about to do resource added ", flush=True)



print(f"{datetime.now()}: trying to run app ", flush=True)

I can see through my print statements that the app gets set up fine, however when I pass data, it get stuck directly on trainer.compute_action. I am trying to host on pythonanywhere but have been having trouble despite their great help. Is there any reason anyone can see for my agent failing to compute_action? or is there a more suitable way to set up an API using ray?

So it looks like the problem is my compute_action is taking more than 10 minutes to complete, however on my local (before deployment) it takes seconds to run the whole process. Does anyone know the right way to go about fixing this?

Hey @Carterbouley , so Trainer.compute_action does succeed, but is just really slow? Seems more like a networking/flask API issue.
Other questions are: How large is your model (number of parameters)?
What hardware are you running on when these delays happen?

Hi @sven1977, no the compute action never succeeds. Something stops it (my print statement before prints and after doesn’t)

my agent config is:

 {'num_workers': 13,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 200,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 1,
 'train_batch_size': 1024,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [256, 256],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': 'battery',
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr': 5e-05,
 'monitor': False,
 'log_level': 'WARN',
 'callbacks': {'on_episode_start': None,
  'on_episode_step': None,
  'on_episode_end': None,
  'on_sample_end': None,
  'on_train_result': None,
  'on_postprocess_traj': None},
 'ignore_worker_failures': False,
 'log_sys_usage': True,
 'use_pytorch': False,
 'eager': False,
 'eager_tracing': False,
 'no_eager_on_workers': False,
 'explore': True,
 'exploration_config': {'type': 'StochasticSampling'},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'in_evaluation': False,
 'evaluation_config': {},
 'evaluation_num_workers': 0,
 'custom_eval_function': None,
 'use_exec_api': False,
 'sample_async': False,
 'observation_filter': 'NoFilter',
 'synchronize_filters': True,
 'tf_session_args': {'intra_op_parallelism_threads': 2,
  'inter_op_parallelism_threads': 2,
  'gpu_options': {'allow_growth': True},
  'log_device_placement': False,
  'device_count': {'CPU': 1},
  'allow_soft_placement': True},
 'local_tf_session_args': {'intra_op_parallelism_threads': 8,
  'inter_op_parallelism_threads': 8},
 'compress_observations': False,
 'collect_metrics_timeout': 180,
 'metrics_smoothing_episodes': 100,
 'remote_worker_envs': False,
 'remote_env_batch_wait_ms': 0,
 'min_iter_time_s': 0,
 'timesteps_per_iteration': 0,
 'seed': None,
 'num_cpus_per_worker': 1,
 'num_gpus_per_worker': 0,
 'custom_resources_per_worker': {},
 'num_cpus_for_driver': 1,
 'memory': 0,
 'object_store_memory': 0,
 'memory_per_worker': 0,
 'object_store_memory_per_worker': 0,
 'input': 'sampler',
 'input_evaluation': ['is', 'wis'],
 'postprocess_inputs': False,
 'shuffle_buffer_size': 0,
 'output': None,
 'output_compress_columns': ['obs', 'new_obs'],
 'output_max_file_size': 67108864,
 'multiagent': {'policies': {},
  'policy_mapping_fn': None,
  'policies_to_train': None},
 'use_critic': True,
 'use_gae': True,
 'lambda': 1.0,
 'kl_coeff': 0.2,
 'sgd_minibatch_size': 128,
 'shuffle_sequences': True,
 'num_sgd_iter': 20,
 'lr_schedule': None,
 'vf_share_layers': False,
 'vf_loss_coeff': 1.0,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'clip_param': 0.3,
 'vf_clip_param': 10.0,
 'grad_clip': None,
 'kl_target': 0.01,
 'simple_optimizer': False}

So not particularly large.

I have the web dev account on Plans and pricing : PythonAnywhere, which doesn’t show the hardware but does show some stats around it (4,000 CPU-seconds per day, * 5GB disk space).

Compute action takes seconds on my local, but over the 10 minute limit on this server, leading to a timeout.

Hmm, very strange. Sorry, it’s hard to remote-diagnose the problem w/o the chance to reproduce this on my end. Would you be able to debug this on the remote machine? Like try to find out where exactly in the RLlib code it hangs?
Trainer.compute_action calls Policy.compute_action calls TFPolicy.compute_action_from_input_dict … calls the ModelV2 to get the distribution outputs, samples from the distribution and returns the sampled action.

Hi @sven1977. Do you know which logs the hanging code would likely be found in when analysing the ray logs? Then I could show you the output of that.

Hi @sven1977. Do you perhaps know of a full end to end project where the Ray agent is served remotely and hosted on a web server via API? It seems the examples on ray serve only show local deployment and setting up on a webserver a new thing for me. Having an example of a project that shows the steps required to do this with a pre-trained agent would be extremely helpful.

Hey @Carterbouley, no sorry, we haven’t done much research in this direction ourselves. Testing and finding out the limits of our external Env API is on our short list of important projects to finish in the near future, though, probably in Q3.
On debugging: Could you simply try to print something inside the different methods its calling for computing actions? Like Trainer.compute_action, Policy.compute_single_action, Policy.compute_actions, ModelV2.__call__. Maybe print out the absolute times? Somehow then, we should be able to see, where the delay happens.

Hi all, any news on that topic?
I’m creating the trainer in separate custom process (on the same node where ray had been initialized) and it also gets stuck on compute_action().
I’m running 1.11.0 and just creating a local worker. Anything I could try?

Thanks

Hello @sven1977 I am facing the same issue but in my local device. I try to have some print statements inside different methods it calls; it seems it doesnt call compute_action when it get stuck cause my code get stuck there the compute_action is not running or getting stuck its not printing the statement I put in there after it gets hang

PS: NVM the problem was somewhere else; ,my debugger was the one getting hang :slight_smile:

1 Like