Join tasks getting stuck in PENDING_NODE_ASSIGNMENT

christina · May 19, 2025, 9:07pm

Thank you Dennis! Great job debugging and running tests, I think you’re definitely on the right track here.

The autoscaler log points to a resource-placement issue rather than a bug in
Dataset.join:

	scheduled right now: {'CPU': 0.125, 'memory': 939524096.0}. This is likely
	due to all cluster resources being claimed by actors. Consider creating
	fewer actors or adding more nodes to this Ray cluster.

Looks like a pure scheduling issue rather than a bug in the join. So I think the culprit here is likely HashShuffleAggregator, which is an actor and not a task. These are probably required by your join operation. The number of these actors is determined by the num_partitions parameter.

Each HashShuffleAggregator actor created by the join reserves ~0.9 GiB of Ray mem. With 2 GB worker pods only two aggregators can fit per node, so when the join tries to launch the 3rd+, they sit in PENDING_NODE_ASSIGNMENT and then job stalls.

I think if you lower the num_partitions it might run faster, maybe try a number that’s less than 4. Maybe start off with 1 if you think the Iris dataset is small enough.

Do you feel like you need a lot of partitions? It might be possible that the scheduling will work better with fewer given the size of the dataset you have. Or you can bump each worker to 4 - 8 GB RAM.

Topic		Replies	Views
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2431	May 25, 2023
Streaming_split map_tasks stuck in pending node assignment forever Ray Data	0	325	October 23, 2023
Ray / gRPC Ambiguous Error Message Kubernetes	12	2365	May 13, 2022
Pending tasks not starting up Kubernetes	7	1674	May 13, 2022
Autoscaling is very slow and not working correctly Ray Clusters	6	661	April 30, 2021

Join tasks getting stuck in PENDING_NODE_ASSIGNMENT

Related topics