1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.40
- Python version: 3.10
3. What happened vs. what you expected:
- Actual:
My ray worker node is configured with 6 CPUs, and the actor’s num_cpu is set to 0.01. Although this setting is not reasonable, this actor hardly requires CPU resources, so it is set relatively small. Theoretically, 600 actors can be scheduled. However, an error occurred when 200 actors were scheduled, saying “failed to lease a worker”. It shows that the node resources are insufficient. At this time, the CPU of this worker is fully utilized, and even CPU throttling occurs frequently. The memory is approximately at 94% of the water level. What could be the problem here?
29ac6bad29a5b035a7c56f402f61d82af56d2894b73ea9d32d4a8c10 with address: 10.66.147.247. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see Ray starts too many workers (and may crash) when using nested remote functions. · Issue #3644 · ray-project/ray · GitHub for some discussion of workarounds).
What could be the reason for this? It seems that the error is not very clear.