How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
hi,
I am using a rollout worker to evaluate a checkpointed APEX policy.
The rollout worker is created by a custom trainable class.
The trainable class is run by the tune.run_experiment
helper.
Before updating the wheel 3.0.0dev and merging the last commit on ray’s master, everything worked fine. Now I did, my program hangs.
The code responsible for this is:
cls = get_trainable_cls("APEX")
agent=cls(config=config, env="custom_env")
when I dig a bit into the ray codebase, I see it actually hangs in worker.py
:
def get_objects(self, object_refs, timeout=None):
# some code
# ....
# it hangs here:
data_metadata_pairs = self.core_worker.get_objects(
object_refs, self.current_task_id, timeout_ms
)
If I set myself the timeout_ms
, I get a timeout exception.
The object_refs
seems to be references to the APEX workers I used to train the checkpointed policy.
When running with local_mode=True
I don’t have the issue. So this is my workaround right now … any ideas?