Get_objects of worker.py timeout

Nicolas_Carrara · May 30, 2022, 1:51pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

hi,

I am using a rollout worker to evaluate a checkpointed APEX policy.

The rollout worker is created by a custom trainable class.

The trainable class is run by the tune.run_experiment helper.

Before updating the wheel 3.0.0dev and merging the last commit on ray’s master, everything worked fine. Now I did, my program hangs.

The code responsible for this is:

cls = get_trainable_cls("APEX")        
agent=cls(config=config, env="custom_env")

when I dig a bit into the ray codebase, I see it actually hangs in worker.py:

def get_objects(self, object_refs, timeout=None):
           # some code
           # ....
          # it hangs here:
          data_metadata_pairs = self.core_worker.get_objects(
                      object_refs, self.current_task_id, timeout_ms
          )

If I set myself the timeout_ms, I get a timeout exception.

The object_refs seems to be references to the APEX workers I used to train the checkpointed policy.

When running with local_mode=True I don’t have the issue. So this is my workaround right now … any ideas?

matthewdeng · June 18, 2022, 2:32am

Hey would you be able to provide a repro for this issue?

Nicolas_Carrara · June 18, 2022, 7:54am

I think the problem was the fact I did not have enough resources (CPU) to run the workers I wanted to run. I have mixed up the workers of the checkpointed policy, and the workers of my custom trainable class.

It will be hard to reproduce, I’ve changed a lot of stuff since and I don’t have the issue anymore.

matthewdeng · June 19, 2022, 10:42pm

That seems reasonable, glad you were able to get unblocked here!

If you ever do run into this again, don’t hesitate to make another post - I’d be very interested in knowing if this happens without a good warning message indicating that there’s a resource hang.

As another minor note, this particular issue might be better suited for the RLlib - Ray category!

Topic		Replies	Views
Worker.get_objects time out Ray Core	0	96	August 14, 2024
Worker Timeout and restart RLlib	0	433	February 15, 2022
`RolloutWorker` does not properly initialize`policy_map` RLlib	1	1266	March 9, 2022
How to troubleshoot hang during a train rollout? RLlib	4	352	November 24, 2022
ValueError: RolloutWorker has no `input_reader` object! RLlib	8	558	March 6, 2024

Get_objects of worker.py timeout

Related topics