Strange errors running Ray on M1 Mac using podman

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

We are trying to run Ray on a Kind cluster on M1 Mac with Podman. We are using KubeRay to create the cluster and Ray jobs HTTP APIs. Here are the errors that we are seeing consistently:
1 Cluster is created successfully.
2 When we submit a job we see the following:

104:27:04 INFO - Launching noop transform
04:27:04 INFO - connecting to existing cluster
04:27:04 INFO - noop parameters are : {'sleep_sec': 10, 'pwd': 'nothing'}
04:27:04 INFO - data factory data_ is using S3 data access: input path - test/noop/input/, output path - test/noop/output/
04:27:04 INFO - data factory data_ max_files -1, n_sample -1
04:27:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
04:27:04 INFO - number of workers 4 worker options {'num_cpus': 0.8}
04:27:04 INFO - pipeline id pipeline_id; number workers 4
04:27:04 INFO - job details {'job category': 'preprocessing', 'job name': 'noop', 'job type': 'ray', 'job id': '754140df-e6d5-4cce-808c-cda128f4e571'}
04:27:04 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
04:27:04 INFO - actor creation delay 0
04:27:04 INFO - Connecting to the existing Ray cluster
2024-05-16 04:27:04,655	INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error
04:28:19 INFO - Exception running ray remote orchestration
Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

04:28:19 INFO - Completed execution in 1.2496037801106772 min, execution result 1

In the ray_client_server_23000.err we see the following:

12024-05-16 04:27:08,551	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:23000, args Namespace(host='0.0.0.0', port=23000, mode='specific-server', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address=None)
2024-05-16 04:27:13,728	INFO server.py:930 -- 25 idle checks before shutdown.
2024-05-16 04:27:18,747	INFO server.py:930 -- 20 idle checks before shutdown.
2024-05-16 04:27:23,766	INFO server.py:930 -- 15 idle checks before shutdown.
2024-05-16 04:27:28,785	INFO server.py:930 -- 10 idle checks before shutdown.
2024-05-16 04:27:33,806	INFO server.py:930 -- 5 idle checks before shutdown.

And finally in the Ray client-server.err we see:

12024-05-16 04:25:56,670	INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='10.244.2.11:6379', redis_password=None, runtime_env_agent_address='http://10.244.2.11:36281')
2024-05-16 04:27:05,083	INFO proxier.py:696 -- New data connection from client f8962dfcb62a4b9dbf4113842e7ec013: 
2024-05-16 04:27:39,278	ERROR proxier.py:333 -- SpecificServer startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,279	INFO proxier.py:341 -- SpecificServer started on port: 23000 with PID: 411 for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:27:39,305	ERROR proxier.py:707 -- Server startup failed for client: f8962dfcb62a4b9dbf4113842e7ec013, using JobConfig: <ray.job_config.JobConfig object at 0x2aaabd6caef0>!
2024-05-16 04:27:57,032	INFO proxier.py:391 -- Specific server f8962dfcb62a4b9dbf4113842e7ec013 is no longer running, freeing its port 23000
2024-05-16 04:28:09,307	ERROR proxier.py:380 -- Timeout waiting for channel for f8962dfcb62a4b9dbf4113842e7ec013
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 375, in get_channel
    grpc.channel_ready_future(server.channel).result(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 162, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/grpc/_utilities.py", line 106, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
2024-05-16 04:28:09,310	WARNING proxier.py:804 -- Retrying Logstream connection. 1 attempts failed.
2024-05-16 04:28:09,324	INFO proxier.py:768 -- f8962dfcb62a4b9dbf4113842e7ec013 last started stream at 1715858825.0607615. Current stream started at 1715858825.0607615.
2024-05-16 04:28:11,313	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:11,313	WARNING proxier.py:804 -- Retrying Logstream connection. 2 attempts failed.
2024-05-16 04:28:13,316	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:13,316	WARNING proxier.py:804 -- Retrying Logstream connection. 3 attempts failed.
2024-05-16 04:28:15,318	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:15,319	WARNING proxier.py:804 -- Retrying Logstream connection. 4 attempts failed.
2024-05-16 04:28:17,321	ERROR proxier.py:351 -- Unable to find channel for client: f8962dfcb62a4b9dbf4113842e7ec013
2024-05-16 04:28:17,321	WARNING proxier.py:804 -- Retrying Logstream connection. 5 attempts failed.

Also in works fine on Intel Mac with docker, Windows with docker and RHEL with docker

Can you share your dockerfile and/or repro script?

The docker file is:

FROM docker.io/rayproject/ray:2.9.3-py310

# install pytest
RUN pip install --no-cache-dir pytest

# Copy in the data processing framework source/project and install it
# This is expected to be placed in the docker context before this is run (see the make image).
COPY --chown=ray:users data-processing-lib/ data-processing-lib/
# install data processing
RUN cd data-processing-lib && pip install --no-cache-dir -e .

COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r  requirements.txt

# copy source data
COPY ./src/noop_transform.py .
COPY ./src/noop_local_ray.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/

# Set environment
ENV PYTHONPATH /home/ray

# Put these at the end since they seem to upset the docker cache.
ARG BUILD_DATE
ARG GIT_COMMIT
LABEL build-date=$BUILD_DATE
LABEL git-commit=$GIT_COMMIT

As for reproduction, it’s a bit more involved. We are using Kind cluster to test KFP invocation of Ray-based applications. The project is here GitHub - IBM/data-prep-kit: Open source project for data preparation of LLM application builders

Hi,

Did you manage to find a solution for this issue ?

This is related to this Github issue : https://github.com/ray-project/ray/issues/29852

Not really. I am not seeing how it is related to the issue above. It works fine on Intel Mac with Docker and not on M1 Mac. It also works fine on RHEL with docker, but not with Podman. SO it is docker vs Podman and M1 vs Intel