Join tasks getting stuck in PENDING_NODE_ASSIGNMENT

Thank you Dennis! Great job debugging and running tests, :smiley: I think you’re definitely on the right track here.

The autoscaler log points to a resource-placement issue rather than a bug in
Dataset.join:

	scheduled right now: {'CPU': 0.125, 'memory': 939524096.0}. This is likely
	due to all cluster resources being claimed by actors. Consider creating
	fewer actors or adding more nodes to this Ray cluster.

Looks like a pure scheduling issue rather than a bug in the join. So I think the culprit here is likely HashShuffleAggregator, which is an actor and not a task. These are probably required by your join operation. The number of these actors is determined by the num_partitions parameter.

Each HashShuffleAggregator actor created by the join reserves ~0.9 GiB of Ray mem. With 2 GB worker pods only two aggregators can fit per node, so when the join tries to launch the 3rd+, they sit in PENDING_NODE_ASSIGNMENT and then job stalls.

I think if you lower the num_partitions it might run faster, maybe try a number that’s less than 4. Maybe start off with 1 if you think the Iris dataset is small enough.

Do you feel like you need a lot of partitions? It might be possible that the scheduling will work better with fewer given the size of the dataset you have. Or you can bump each worker to 4 - 8 GB RAM.

1 Like