Get_objects of worker.py timeout

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

hi,

I am using a rollout worker to evaluate a checkpointed APEX policy.

The rollout worker is created by a custom trainable class.

The trainable class is run by the tune.run_experiment helper.

Before updating the wheel 3.0.0dev and merging the last commit on ray’s master, everything worked fine. Now I did, my program hangs.

The code responsible for this is:

cls = get_trainable_cls("APEX")        
agent=cls(config=config, env="custom_env")

when I dig a bit into the ray codebase, I see it actually hangs in worker.py:

def get_objects(self, object_refs, timeout=None):
           # some code
           # ....
          # it hangs here:
          data_metadata_pairs = self.core_worker.get_objects(
                      object_refs, self.current_task_id, timeout_ms
          )

If I set myself the timeout_ms, I get a timeout exception.

The object_refs seems to be references to the APEX workers I used to train the checkpointed policy.

When running with local_mode=True I don’t have the issue. So this is my workaround right now … any ideas?

Hey would you be able to provide a repro for this issue?

I think the problem was the fact I did not have enough resources (CPU) to run the workers I wanted to run. I have mixed up the workers of the checkpointed policy, and the workers of my custom trainable class.

It will be hard to reproduce, I’ve changed a lot of stuff since and I don’t have the issue anymore.

That seems reasonable, glad you were able to get unblocked here!

If you ever do run into this again, don’t hesitate to make another post - I’d be very interested in knowing if this happens without a good warning message indicating that there’s a resource hang.

As another minor note, this particular issue might be better suited for the RLlib - Ray category!

1 Like