Strange errors running Ray on M1 Mac using podman

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

We are trying to run Ray on a Kind cluster on M1 Mac with Podman. We are using KubeRay to create the cluster and Ray jobs HTTP APIs. Here are the errors that we are seeing consistently:
1 Cluster is created successfully.
2 When we submit a job we see the following:

104:27:04 INFO - Launching noop transform
04:27:04 INFO - connecting to existing cluster
04:27:04 INFO - noop parameters are : {'sleep_sec': 10, 'pwd': 'nothing'}
04:27:04 INFO - data factory data_ is using S3 data access: input path - test/noop/input/, output path - test/noop/output/
04:27:04 INFO - data factory data_ max_files -1, n_sample -1
04:27:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
04:27:04 INFO - number of workers 4 worker options {'num_cpus': 0.8}
04:27:04 INFO - pipeline id pipeline_id; number workers 4
04:27:04 INFO - job details {'job category': 'preprocessing', 'job name': 'noop', 'job type': 'ray', 'job id': '754140df-e6d5-4cce-808c-cda128f4e571'}
04:27:04 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
04:27:04 INFO - actor creation delay 0
04:27:04 INFO - Connecting to the existing Ray cluster
2024-05-16 04:27:04,655	INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
04:28:19 INFO - Exception running ray remote orchestration
Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

04:28:19 INFO - Completed execution in 1.2496037801106772 min, execution result 1

In the ray_client_server_23000.err we see the following:

12024-05-16 04:27:08,551	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address=None)
2024-05-16 04:27:13,728	INFO server.py:930 -- 25 idle checks before shutdown.
2024-05-16 04:27:18,747	INFO server.py:930 -- 20 idle checks before shutdown.
2024-05-16 04:27:23,766	INFO server.py:930 -- 15 idle checks before shutdown.
2024-05-16 04:27:28,785	INFO server.py:930 -- 10 idle checks before shutdown.
2024-05-16 04:27:33,806	INFO server.py:930 -- 5 idle checks before shutdown.

And finally in the Ray client-server.err we see:

12024-05-16 04:25:56,670	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address='http://10.244.2.11:36281')
2024-05-16 04:27:05,083	INFO proxier.py:696 -- New data connection from client f8962dfcb62a4b9dbf4113842e7ec013: 
2024-05-16 04:27:39,278	ERROR proxier.py:333 -- SpecificServer startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,279	INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 411 for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,305	ERROR proxier.py:707 -- Server startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013, using JobConfig: <ray.job_config.JobConfig object at 0x2aaabd6caef0>!
2024-05-16 04:27:57,032	INFO proxier.py:391 -- Specific server f8962dfcb62a4b9dbf4113842e7ec013 is no longer running, freeing its port 23000
2024-05-16 04:28:09,307	ERROR proxier.py:380 -- Timeout waiting for channel for f8962dfcb62a4b9dbf4113842e7ec013
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-05-16 04:28:09,310	WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-05-16 04:28:09,324	INFO proxier.py:768 -- f8962dfcb62a4b9dbf4113842e7ec013 last started stream at 1715858825.0607615. Current stream started at 1715858825.0607615.
2024-05-16 04:28:11,313	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:11,313	WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-05-16 04:28:13,316	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:13,316	WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-05-16 04:28:15,318	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:15,319	WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-05-16 04:28:17,321	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:17,321	WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.

Also in works fine on Intel Mac with docker, Windows with docker and RHEL with docker

Can you share your dockerfile and/or repro script?

The docker file is:

FROM docker.io/rayproject/ray:2.9.3-py310

# install pytest
RUN pip install --no-cache-dir pytest

# Copy in the data processing framework source/project and install it
# This is expected to be placed in the docker context before this is run (see the make image).
COPY --chown=ray:users data-processing-lib/ data-processing-lib/
# install data processing
RUN cd data-processing-lib && pip install --no-cache-dir -e .

COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r  requirements.txt

# copy source data
COPY ./src/noop_transform.py .
COPY ./src/noop_local_ray.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/

# Set environment
ENV PYTHONPATH /home/ray

# Put these at the end since they seem to upset the docker cache.
ARG BUILD_DATE
ARG GIT_COMMIT
LABEL build-date=$BUILD_DATE
LABEL git-commit=$GIT_COMMIT

As for reproduction, it’s a bit more involved. We are using Kind cluster to test KFP invocation of Ray-based applications. The project is here GitHub - IBM/data-prep-kit: Open source project for data preparation of LLM application builders