Please help me understand how to triage failures in Ray AIR

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’ve got a reasonable amount of experience with RL, RLLib, Gym, e.g., but I’m running into an issue transitioning into the AIR framework.

I have a testing suite set up in a containerized environment; I’m using the following package versions:

    ray[default]==2.3.0  # Now at 2.2.0
    ray[tune]==2.3.0
    ray[rllib]==2.3.0
    ray[air]==2.3.0
    torch>=1.13.1
    # This lets us investigate ppo training during development
    tensorboard>=2.12.0
    # This is required even though we're using Torch.
    tensorflow_probability>=0.19.0
    xgboost>=1.7.4
    xgboost_ray>=0.1.15

My workflow creates data (validated, tests pass), fits an XGBoost model to the data (validated, tests pass), then spins up the RL experiment (Nonfunctional.)

The offending code:

####################################################################
#                       Test RL Environments                       #
# ---------------------------------------------------------------- #
# Test the reinforcement learning environments.                    #
####################################################################
@pytest.mark.usefixtures("environment", "logged_model", "raycluster", "tempfolder")
@pytest.mark.parametrize(
    "simulator",
    [("Sim1", 'model'), ("Sim2", None)],
    ids=["Sim1", "Sim2"]
)
def test_rllib_on_env(environment, simulator, logged_model, metadata, raycluster, tempfolder):
    #############################
    # Prepare data and simulator#
    #############################
    train_dataset = metadata.get_dataset("TRAIN_DATASET")
    model_folder = os.path.join(tempfolder, 'rllibtest', 'model')
    simulator, sim_model = simulator
    env_kwargs = {}
    if sim_model is not None:
        # Create a folder to dump things into.
        # Let's redefine the model.
        # This doesn't actually fit anything, it just reuses the
        #   structure!
        env_kwargs['model'] = CompetitionModel(
            training_dataset=metadata.get_dataset("TRAIN_DATASET"),
            testing_dataset=metadata.get_dataset("TEST_DATASET"),
            target_name="TARGET",
            artifact_dir=os.path.join(tempfolder, "rayresults"),
            random_state=10,
        )
        # Go get the artifacts *for this guy*
        metadata.get_artifacts(logged_lapse_model, model_folder)
        # And spin him back up
        env_kwargs['model'].fit(restore_dir=os.path.join(model_folder, 'artifacts'))
    ############################
    # Register the Environment #
    ############################
    env_name, env_class = environment
    register_env(env_name, lambda config: env_class(**config))
    #############################
    # Register the Custom Model #
    #############################
    # ModelCatalog.register_custom_model("my_model", TorchCustomModel)
    #########################################
    # Declare an RL Algorithm Configuration #
    #########################################
    _env_config = {
        "dataset": train_dataset,
        "target_column": "TARGET",
        "current_premium_column": "FEATURE_1",
        "indicated_premium_column": "FEATURE_3",
        "simulator": simulator,
        **env_kwargs
    }
    
    config = (
        get_trainable_cls("PPO")
        .get_default_config()
        .environment(
            env_name,
            env_config=_env_config,
        )
        .framework("torch")
        .rollouts(num_rollout_workers=1)
        # .training(
        #     num_sgd_iter = 2,
        #     model={
        #         "custom_model": "my_model",
        #         "vf_share_layers": True,
        #     },
        # )
    )

    ########################################
    # Declare Experiment Stopping Criteria #
    ########################################
    stop = {
        "training_iteration": 1,
        "timesteps_total": 10,
        "episode_reward_mean": 100,
    }
    ppo_folder = os.path.join(tempfolder, 'rllibtest', simulator, 'ppo')
    os.makedirs(ppo_folder, exist_ok=True)
    tuner = tune.Tuner(
        "PPO",
        param_space=config.to_dict(),
        run_config=air.RunConfig(
            name='ppotest',
            local_dir=ppo_folder,
            stop=stop,
            verbose=2,
            log_to_file=True,
            failure_config=air.FailureConfig(
                max_failures=0,
                fail_fast=True
            )
        )
    )
    results = tuner.fit()
    print("Checking if learning goals were achieved")
    print(check_learning_achieved(results, 100))
    raise Exception("WE BE DONE!", config.to_dict())

This runs (apparently endlessly) and logs no events to tensorboard.

Is there a way to force the framework to dump verbose logs that I’m not using already? I’m trying to troubleshoot what’s going on in the PPO training algorithm to cause it to silently die.

As a followup I’ve been able to identify that the worker is endlessly dying and being recreated; here is an example.

--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. Set to -1 for getting the entire file. ---

[2023-03-28 17:39:42,739 I 19607 19607] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 19607
[2023-03-28 17:39:42,743 I 19607 19607] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-03-28 17:39:42,745 I 19607 19607] grpc_server.cc:140: worker server started, listening on port 35063.
[2023-03-28 17:39:42,749 I 19607 19607] core_worker.cc:215: Initializing worker at address: 172.17.0.3:35063, worker ID 417207edc1e34081084fa83d208fbf78e68b2d1215a59fba06dc7ac4, raylet 039ee9c951659975bc38c1ef984ef7afca492a7a402e0d5765de16db
[2023-03-28 17:39:42,750 I 19607 19607] task_event_buffer.cc:149: Reporting task events to GCS every 1000ms.
[2023-03-28 17:39:42,751 I 19607 19607] core_worker.cc:602: Adjusted worker niceness to 15
[2023-03-28 17:39:42,751 I 19607 19641] core_worker.cc:550: Event stats:


Global stats: 12 total (8 active)
Queueing time: mean = 12.575 us, max = 101.700 us, min = 1.700 us, total = 150.900 us
Execution time:  mean = 13.258 us, total = 159.100 us
Event stats:
	PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 4.833 us, total = 29.000 us
	UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 15.300 us, total = 15.300 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 114.800 us, total = 114.800 us

-----------------
Task Event stats:

IO Service Stats:

Global stats: 2 total (1 active)
Queueing time: mean = 6.050 us, max = 12.100 us, min = 12.100 us, total = 12.100 us
Execution time:  mean = 3.550 us, total = 7.100 us
Event stats:
	PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 7.100 us, total = 7.100 us
	CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
Other Stats:
	grpc_in_progress:0
	current number of task events in buffer: 0
	total task events sent: 0 MiB
	total number of task events sent: 0
	num status task events dropped: 0
	num profile task events dropped: 0


[2023-03-28 17:39:42,751 I 19607 19641] accessor.cc:590: Received notification for node id = 039ee9c951659975bc38c1ef984ef7afca492a7a402e0d5765de16db, IsAlive = 1
[2023-03-28 17:39:42,751 I 19607 19641] core_worker.cc:3847: Number of alive nodes:1
[2023-03-28 17:39:42,754 I 19607 19607] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 477351fa1366095427497ca701000000
[2023-03-28 17:39:42,754 I 19607 19607] direct_actor_task_submitter.cc:237: Connecting to actor 477351fa1366095427497ca701000000 at worker 417207edc1e34081084fa83d208fbf78e68b2d1215a59fba06dc7ac4
[2023-03-28 17:39:42,754 I 19607 19607] core_worker.cc:2548: Creating actor: 477351fa1366095427497ca701000000
[2023-03-28 17:39:43,032 I 19607 19607] direct_actor_transport.cc:180: Actor creation task finished, task_id: ffffffffffffffff477351fa1366095427497ca701000000, actor_id: 477351fa1366095427497ca701000000
[2023-03-28 17:39:43,845 I 19607 19641] core_worker.cc:3406: Force kill actor request has received. exiting immediately... The actor is dead because it was killed by `ray.kill`.
[2023-03-28 17:39:43,845 W 19607 19641] core_worker.cc:778: Force exit the process.  Details: Worker exits because the actor is killed. The actor is dead because it was killed by `ray.kill`.
[2023-03-28 17:39:43,845 I 19607 19641] core_worker.cc:698: Disconnecting to the raylet.
[2023-03-28 17:39:43,845 I 19607 19641] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because the actor is killed. The actor is dead because it was killed by `ray.kill`., has creation_task_exception_pb_bytes=0