Will Ray over-subscribe the bundles in a placement groups?

The code in https://gist.github.com/ceteri/77b17bc21bb24d8074f2688899ed063a is a solution to the fault-tolerance issues for memory-aware scheduling of Ray actors, which was reported in https://discuss.ray.io/t/how-to-detect-when-creating-an-actor-fails/4765 and also related to https://discuss.ray.io/t/using-placement-groups-while-connecting-to-ray-cluster/1400

However, I’ve noticed that if you:

  1. remove the placement_group_bundle_index = i constraint
  2. set the number of n_actors greater than the pre-allocated n_bundles

then Ray will happily allocate new actors in the placement group – way beyond the limits of the total resources defined for the placement group.

Should I report this in GH as a bug, or is it a known behavior (i.e., not impl yet) ?

Many thanks to @dmatrix @christy for their help

Actors have zero resource requests by default (this may be changed in the future since it’s confusing). If you create actors with num_cpus=1, does this still happen?

Yes, if you check the code in this gist then when actors get created num_cpus = 2

Ah, it is because of this block;

    try:
        for i in range(n_actors):
            print(i)

            h = FailingActor.options(
                placement_group = pg,
                num_cpus = n_cores,
                memory = mem_unit,
                placement_group_bundle_index = i,
            ).remote(
                name = str(i),
                size = 10000,
                fail = False, # True
            )

            print(h)

            try:
                ray.get(h.ready.remote())
            except ray.exceptions.RayActorError as e:
                print(e)
                break

        print("finished allocating actors")

Basically, the ref to h goes out of scope for each iteration, and this kills an actor (actor handles are also ref counted and GC’ed if there’s no more handle). So you always have only 1 concurrently alive actor at a time.

If you change your block this way;

    n_actors = 3
    print("Total actors: ", n_actors)
    hs = []

    try:
        for i in range(n_actors):
            print(i)

            h = FailingActor.options(
                placement_group = pg,
                num_cpus = n_cores,
                memory = mem_unit,
                placement_group_bundle_index = i,
            ).remote(
                name = str(i),
                size = 10000,
                fail = False, # True
            )
            hs.append(h)

            print(h)

            try:
                ray.get(h.ready.remote())
            except ray.exceptions.RayActorError as e:
                print(e)
                break

        print("finished allocating actors")

    except ValueError as e:
        print("cannot allocate", i)
    except Exception as e:
        print(e)

This should work as expected.

1 Like

Thank you very much @sangcho
Yes, that works now.

The placement_group_bundle_index = i param is still needed to avoid the problem with an infeasible task due to memory requests – and as you mentioned we’ll wait fo the InfeasibleTaskException impl in a later release