PENDING_CREATION problem

The cluster is run on k8s. The nodes is in Pod. Code:


@ray.remote(num_cpus=4)
class Counter(object):
    def __init__(self):
        self.value = 0

    def increment(self):
        time.sleep(1)
        self.value += 1
        return self.value

    def get_counter(self):
        return self.value


counters = [Counter.remote() for _ in range(20)]

# Increment each Counter once and get the results. These tasks all happen in
# parallel.
results = ray.get([c.increment.remote() for c in counters])
print(results)  # prints [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Increment the first Counter five times. These tasks are executed serially
# and share state.
results = ray.get([counters[0].increment.remote() for _ in range(5)])
print(results)  # prints [2, 3, 4, 5, 6]

Logs:

We’ll probably need more reproduction info to figure out what’s up and debug.
How did you set up the Ray pod? Could you share configs and software versions?

@sangcho @rickyyx to take over on this thread

I think you just don’t have enough cpus to create 4 CPUs * 20 counters? What’s the output of ray status?

I find the problem is k8s donot have enough resources for pod. Thx

At last, the reason is memory of head pod is too small. I modified the memory 4G to 32G, and the problem disappeared. But I cannot find useful info in the process.