How many ports to open for ray.data?

Right now, I’m opening up “num cores + 2” ports for my workers per node. However, when running a ray.data job, I get the following error:

Failed to register worker 257ee8624e2c457234a4eee91ff87fde742a8fc85d0e4292c2073398 to Raylet. Invalid: Invalid: No available ports. Please specify a wider port range using --min-worker-port and --max-worker-port.

And output in raylet.out like:

[2021-08-26 17:25:31,309 I 297 314] store.cc:315: Object store current usage 0.000327152 / 31.4038 GB.
[2021-08-26 17:25:31,538 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2427
[2021-08-26 17:25:32,009 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:25:32,009 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:25:32,087 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:25:42,159 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:25:52,234 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:26:01,538 I 297 297] worker_pool.cc:396: Some workers of the worker process(2427) have not registered to raylet within timeout.
[2021-08-26 17:26:03,935 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2455
[2021-08-26 17:26:04,418 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:26:04,418 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:26:05,250 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2478
[2021-08-26 17:26:05,749 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:26:05,749 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:26:21,752 W 297 297] node_manager.cc:764: The actor or task with ID 3adbad3aeb3934e58bd4d1b7fcbf5f28361ae3c09ceff84c cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU: 1.000000}
Available resources on this node: {3.000000/5.000000 CPU, 58.929904 GiB/58.929904 GiB memory, 29.247055 GiB/29.247055 GiB object_store_memory, 1.000000/1.000000 node:10.86.136.216}
In total there are 2 pending tasks and 0 pending actors on this node.
[2021-08-26 17:26:21,794 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:26:31,830 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:26:33,935 I 297 297] worker_pool.cc:396: Some workers of the worker process(2455) have not registered to raylet within timeout.
[2021-08-26 17:26:35,250 I 297 297] worker_pool.cc:396: Some workers of the worker process(2478) have not registered to raylet within timeout.
[2021-08-26 17:26:35,318 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2502
[2021-08-26 17:26:35,823 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:26:35,823 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:26:41,926 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:26:52,032 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:02,110 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:05,318 I 297 297] worker_pool.cc:396: Some workers of the worker process(2502) have not registered to raylet within timeout.
[2021-08-26 17:27:05,457 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2523
[2021-08-26 17:27:06,045 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:27:06,045 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:27:12,197 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:22,211 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:32,219 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:35,457 I 297 297] worker_pool.cc:396: Some workers of the worker process(2523) have not registered to raylet within timeout.
[2021-08-26 17:27:36,107 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2543
[2021-08-26 17:27:36,634 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:27:36,634 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:27:42,314 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:27:52,410 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:28:02,510 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:28:06,107 I 297 297] worker_pool.cc:396: Some workers of the worker process(2543) have not registered to raylet within timeout.
[2021-08-26 17:28:06,740 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2563
[2021-08-26 17:28:07,226 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:28:07,226 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:28:12,522 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:28:22,618 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:28:32,715 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:28:36,740 I 297 297] worker_pool.cc:396: Some workers of the worker process(2563) have not registered to raylet within timeout.
[2021-08-26 17:28:50,924 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:29:00,978 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:29:08,510 I 297 297] worker_pool.cc:369: Started worker process of 1 worker(s) with pid 2578
[2021-08-26 17:29:08,975 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:08,976 I 297 297] node_manager.cc:1208: Ignoring client disconnect because the client has already been disconnected.
[2021-08-26 17:29:21,759 W 297 297] node_manager.cc:764: The actor or task with ID 7a19f4296ab1571cc97ee61fe2cb4837d068dbaf265395a3 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU: 1.000000}
Available resources on this node: {4.000000/5.000000 CPU, 58.929904 GiB/58.929904 GiB memory, 29.247055 GiB/29.247055 GiB object_store_memory, 1.000000/1.000000 node:10.86.136.216}
In total there are 1 pending tasks and 0 pending actors on this node.
[2021-08-26 17:29:21,790 I 297 297] node_manager.cc:624: Sending Python GC request to 7 local workers to clean up Python cyclic references.
[2021-08-26 17:29:27,092 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,092 I 297 297] node_manager.cc:1288: Driver (pid=1775) is disconnected. job_id: 03000000
[2021-08-26 17:29:27,092 I 297 297] node_manager.cc:872: Owner process 03000000ffffffffffffffffffffffffffffffffffffffffffffffff died, killing leased worker 51a050ae644ead33c4c1eb609e05394eda847a547d2122de15eded32
[2021-08-26 17:29:27,092 I 297 297] node_manager.cc:872: Owner process 03000000ffffffffffffffffffffffffffffffffffffffffffffffff died, killing leased worker f54979f304a0035ad3819d01bb994b11611c35192659a21f2fead802
[2021-08-26 17:29:27,182 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,182 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,183 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,205 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,205 I 297 297] node_manager.cc:1266: A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: bdda66698477aa47b6d6cbdcfe02db5368fe0a219f785adb Worker ID: 51a050ae644ead33c4c1eb609e05394eda847a547d2122de15eded32 Node ID: 6024c240df28fe1057e8f695a4a6a5d92cbe4f100c95898c28a65695 Worker IP address: XXXX Worker port: 31019 Worker PID: 1962
[2021-08-26 17:29:27,230 I 297 297] node_manager.cc:1194: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-08-26 17:29:27,230 I 297 297] node_manager.cc:1266: A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 55bb684e85df8dc50f9e6644ad831563122cf5ca7afee325 Worker ID: f54979f304a0035ad3819d01bb994b11611c35192659a21f2fead802 Node ID: 6024c240df28fe1057e8f695a4a6a5d92cbe4f100c95898c28a65695 Worker IP address: XXXX Worker port: 31020 Worker PID: 1874
[2021-08-26 17:29:31,793 I 297 297] node_manager.cc:624: Sending Python GC request to 1 local workers to clean up Python cyclic references.

Any ideas?

Can you try like 2X? Ray has a mechanism to launch a new worker although there are more than num_cpus workers if it has a blocking APIs inside (e.g., ray.get). Maybe dataset implementation uses some of this pattern. Also cc @Clark_Zinzow @Alex to confirm.

increased port number from (#cpu + 2) to 2 x (#cpu + 2 ). The ports contention message not longer show up. But the data reading is still very slow.