How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
We are trying to run Ray on a Kind cluster on M1 Mac with Podman. We are using KubeRay to create the cluster and Ray jobs HTTP APIs. Here are the errors that we are seeing consistently:
1 Cluster is created successfully.
2 When we submit a job we see the following:
104:27:04 INFO - Launching noop transform
04:27:04 INFO - connecting to existing cluster
04:27:04 INFO - noop parameters are : {'sleep_sec': 10, 'pwd': 'nothing'}
04:27:04 INFO - data factory data_ is using S3 data access: input path - test/noop/input/, output path - test/noop/output/
04:27:04 INFO - data factory data_ max_files -1, n_sample -1
04:27:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
04:27:04 INFO - number of workers 4 worker options {'num_cpus': 0.8}
04:27:04 INFO - pipeline id pipeline_id; number workers 4
04:27:04 INFO - job details {'job category': 'preprocessing', 'job name': 'noop', 'job type': 'ray', 'job id': '754140df-e6d5-4cce-808c-cda128f4e571'}
04:27:04 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
04:27:04 INFO - actor creation delay 0
04:27:04 INFO - Connecting to the existing Ray cluster
2024-05-16 04:27:04,655 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
04:28:19 INFO - Exception running ray remote orchestration
Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath
raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.
04:28:19 INFO - Completed execution in 1.2496037801106772 min, execution result 1
In the ray_client_server_23000.err we see the following:
12024-05-16 04:27:08,551 INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address=None)
2024-05-16 04:27:13,728 INFO server.py:930 -- 25 idle checks before shutdown.
2024-05-16 04:27:18,747 INFO server.py:930 -- 20 idle checks before shutdown.
2024-05-16 04:27:23,766 INFO server.py:930 -- 15 idle checks before shutdown.
2024-05-16 04:27:28,785 INFO server.py:930 -- 10 idle checks before shutdown.
2024-05-16 04:27:33,806 INFO server.py:930 -- 5 idle checks before shutdown.
And finally in the Ray client-server.err we see:
12024-05-16 04:25:56,670 INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address='http://10.244.2.11:36281')
2024-05-16 04:27:05,083 INFO proxier.py:696 -- New data connection from client f8962dfcb62a4b9dbf4113842e7ec013:
2024-05-16 04:27:39,278 ERROR proxier.py:333 -- SpecificServer startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,279 INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 411 for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,305 ERROR proxier.py:707 -- Server startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013, using JobConfig: <ray.job_config.JobConfig object at 0x2aaabd6caef0>!
2024-05-16 04:27:57,032 INFO proxier.py:391 -- Specific server f8962dfcb62a4b9dbf4113842e7ec013 is no longer running, freeing its port 23000
2024-05-16 04:28:09,307 ERROR proxier.py:380 -- Timeout waiting for channel for f8962dfcb62a4b9dbf4113842e7ec013
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
grpc.channel_ready_future(server.channel).result(
File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
self._block(timeout)
File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-05-16 04:28:09,310 WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-05-16 04:28:09,324 INFO proxier.py:768 -- f8962dfcb62a4b9dbf4113842e7ec013 last started stream at 1715858825.0607615. Current stream started at 1715858825.0607615.
2024-05-16 04:28:11,313 ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:11,313 WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-05-16 04:28:13,316 ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:13,316 WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-05-16 04:28:15,318 ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:15,319 WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-05-16 04:28:17,321 ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:17,321 WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.