Raylet worker processes are failing

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m running a GKE-flavoured KubeRay operator with a RayCluster deployed in it. My current problem is that any job submitted via Job Submission SDK (with an in-house wrapper on top of it) fails to start due to some cryptic internal Ray error, and I would appreciate any help with it.
This problem is reproducible each time, so I’ll be happy to help with debugging and fixing it, as it blocks me from adopting Ray.

Error message from the raylet.err log in the worker node:

[2025-03-04 08:18:42,827 E 74 74] (raylet) worker_pool.cc:581: Some workers of the worker process(8902) have not registered within the timeout. The process is dead, probably it crashed during start.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/workers/default_worker.py", line 8, in <module>
    import ray._private.node
ModuleNotFoundError: No module named 'ray._private'

while the file itself is there:

(base) ray@raycluster-default-worker-mrwpg:~$ ~/tmp/session_latest/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python -c 'import ray._private.node; print(ray._private.node.__file__)'
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/node.py

The same message repeats continuously with new worker process PIDs and Ray job itself staying in a pending status indefinitely.
Unfortunately, there are no logs for the crashing worker processes, looks like they are terminated before being able to write anything out.

Setup

ray_cluster.yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster
spec:
  rayVersion: "2.43.0"
  enableInTreeAutoscaling: true
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
      disable-usage-stats: "true"
      temp-dir: "/home/ray/tmp"
      plasma-directory: "/home/ray/tmp"
    template:
      spec:
        containers:
          - name: ray-head
            image: eu.gcr.io/redacted/custom-ray-image:latest
            env:
              - name: RAY_event_stats
                value: "0"
            resources:
              limits:
                cpu: 3
                memory: 15G
                ephemeral-storage: "8Gi"
              requests:
                cpu: 3
                memory: 15G
                ephemeral-storage: "8Gi"
            ports:
              - containerPort: 6379
                name: gcs-server
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            lifecycle:
              preStop:
                exec:
                  command: ["/bin/sh","-c","ray stop"]
            volumeMounts:
              - name: workspace
                mountPath: "/home/ray/tmp"
        securityContext:
          fsGroup: 1000
        volumes:
          - name: workspace
            ephemeral:
              volumeClaimTemplate:
                spec:
                  accessModes: [ "ReadWriteOnce" ]
                  storageClassName: "standard"
                  resources:
                    requests:
                      storage: 100Gi
  workerGroupSpecs:
    - groupName: default
      replicas: 1
      minReplicas: 1
      maxReplicas: 5
      rayStartParams:
        disable-usage-stats: "true"
      template:
        spec:
          serviceAccount: "raycluster"
          nodeSelector:
            cloud.google.com/gke-spot: "true"
          restartPolicy: Never
          containers:
            - name: ray-worker
              image: eu.gcr.io/redacted/custom-ray-image:latest
              resources:
                limits:
                  cpu: 15
                  memory: 63Gi
                  ephemeral-storage: "9Gi"
                requests:
                  cpu: 15
                  memory: 63Gi
                  ephemeral-storage: "9Gi"
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh","-c","ray stop"]
              env:
                - name: RAY_TMPDIR
                  value: "/home/ray/tmp"
              volumeMounts:
                - name: workspace
                  mountPath: "/home/ray/tmp"
          securityContext:
            fsGroup: 1000
          volumes:
            - name: workspace
              ephemeral:
                volumeClaimTemplate:
                  spec:
                    accessModes: [ "ReadWriteOnce" ]
                    storageClassName: "standard"
                    resources:
                      requests:
                        storage: 100Gi

Runtime environment

py_modules:
  - gs://redacted-bucket-name/runtime-env/source_code.zip
uv:
  packages: [
    # Dependencies list is redacted
  ]
  uv_pip_install_options: []
  uv_check: false
env_vars:
  UV_INDEX: http://redacted-pypi-uri
  UV_LINK_MODE: copy

Container image, used both by head and worker pods

FROM rayproject/ray:2.43.0-py311-cpu

RUN sudo apt-get update \
    && sudo apt-get install -y --no-install-recommends git curl lsb-release libgomp1 gnupg make pkg-config libhdf5-dev gcc \
    && sudo rm -rf /var/lib/apt/lists/* \
    && $HOME/anaconda3/bin/pip install smart_open[gcs]

raylet.out, worker node

[2025-03-04 08:27:43,408 I 74 74] (raylet) runtime_env_agent_client.cc:383: Create runtime env for job 01000000
[2025-03-04 08:27:43,412 I 74 74] (raylet) worker_pool.cc:522: Started worker process with pid 10927, the token is 49
[2025-03-04 08:28:22,983 I 74 105] (raylet) store.cc:564: Plasma store debug dump: 
Current usage: 0 / 33.7829 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2025-03-04 08:28:23,040 I 74 74] (raylet) node_manager.cc:532: [state-dump] NodeManager:
[state-dump] Node ID: 397034112839deeeaff5cb49d4465f4f6b09702f43b8e95692d5ad11
[state-dump] Node name: 10.21.4.38
[state-dump] InitialConfigResources: {object_store_memory: 337828626430000, CPU: 290000, memory: 1127428915200000, node:10.21.4.38: 10000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 397034112839deeeaff5cb49d4465f4f6b09702f43b8e95692d5ad11 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 1
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 1
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state: 
[state-dump] Local id: -8935084048543528547 Local resources: {"total":{memory: [1127428915200000], object_store_memory: [337828626430000], CPU: [290000], node:10.21.4.38: [10000]}}, "available": {memory: [1116691496960000], object_store_memory: [337828626430000], CPU: [280000], node:10.21.4.38: [10000]}}, "labels":{"ray.io/node_id":"397034112839deeeaff5cb49d4465f4f6b09702f43b8e95692d5ad11",} is_draining: 0 is_idle: 0 Cluster resources: node id: -8935084048543528547{"total":{CPU: 290000, object_store_memory: 337828626430000, node:10.21.4.38: 10000, memory: 1127428915200000}}, "available": {CPU: 280000, node:10.21.4.38: 10000, memory: 1116691496960000, object_store_memory: 337828626430000}}, "labels":{"ray.io/node_id":"397034112839deeeaff5cb49d4465f4f6b09702f43b8e95692d5ad11",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 7784598607884737558{"total":{node:10.21.6.18: 10000, CPU: 150000, object_store_memory: 202539196410000, memory: 676457349120000}}, "available": {memory: 676457349120000, CPU: 150000, node:10.21.6.18: 10000, object_store_memory: 202539196410000}}, "labels":{"ray.io/node_id":"1eaa72da8564f86e61a42699c9a9321d343de00bd2d296696d07b7e0",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1}node id: 341252914104009421{"total":{node:10.21.0.125: 10000, node:__internal_head__: 10000, memory: 150000000000000, object_store_memory: 44076417020000}}, "available": {node:__internal_head__: 10000, memory: 150000000000000, object_store_memory: 44076417020000, node:10.21.0.125: 10000}}, "labels":{"ray.io/node_id":"74b41c88769732857d4cb3fb13e6ea328eb578c18d0d197d8e2d17d0",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 1
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Backlog Size per scheduling descriptor :{workerId: num backlogs}:
[state-dump] 
[state-dump] Running tasks by scheduling class:
[state-dump]     - {depth=1 function_descriptor={type=PythonFunctionDescriptor, module_name=ray.dashboard.modules.job.job_supervisor, class_name=JobSupervisor, function_name=__init__, function_hash=9036cebbd0b144a594ef7fed9759a40b} scheduling_strategy=default_scheduling_strategy {
[state-dump] }
[state-dump]  resource_set={CPU : 1, memory : 1.07374e+09, }}: 1/29
[state-dump] ==================================================
[state-dump] 
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump] 
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 33782862643
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump] 
[state-dump] WorkerPool:
[state-dump] - registered jobs: 2
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 49
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num PYTHON pending start requests: 0
[state-dump] - num PYTHON pending registration requests: 1
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers: 
[state-dump] 1eaa72da8564f86e61a42699c9a9321d343de00bd2d296696d07b7e0
[state-dump] 74b41c88769732857d4cb3fb13e6ea328eb578c18d0d197d8e2d17d0
[state-dump] Event stats:
[state-dump] Global stats: 144489 total (15 active)
[state-dump] Queueing time: mean = 85.531 us, max = 163.400 ms, min = -0.000 s, total = 12.358 s
[state-dump] Execution time:  mean = 20.808 ms, total = 3006.545 s
[state-dump] Event stats:
[state-dump]    RaySyncer.OnDemandBroadcasting - 31768 total (1 active), Execution time: mean = 8.941 us, total = 284.024 ms, Queueing time: mean = 93.895 us, max = 154.543 ms, min = 12.911 us, total = 2.983 s
[state-dump]    NodeManager.CheckGC - 31768 total (1 active), Execution time: mean = 1.778 us, total = 56.491 ms, Queueing time: mean = 100.677 us, max = 154.544 ms, min = 11.620 us, total = 3.198 s
[state-dump]    ObjectManager.UpdateAvailableMemory - 31767 total (0 active), Execution time: mean = 2.889 us, total = 91.790 ms, Queueing time: mean = 39.688 us, max = 1.796 ms, min = 2.360 us, total = 1.261 s
[state-dump]    RayletWorkerPool.deadline_timer.kill_idle_workers - 15893 total (1 active), Execution time: mean = 9.943 us, total = 158.027 ms, Queueing time: mean = 86.920 us, max = 159.494 ms, min = 11.430 us, total = 1.381 s
[state-dump]    MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 12697 total (1 active), Execution time: mean = 382.918 us, total = 4.862 s, Queueing time: mean = 85.785 us, max = 145.111 ms, min = -0.000 s, total = 1.089 s
[state-dump]    NodeManager.deadline_timer.flush_free_objects - 3180 total (1 active), Execution time: mean = 7.632 us, total = 24.270 ms, Queueing time: mean = 160.610 us, max = 163.107 ms, min = -0.000 s, total = 510.738 ms
[state-dump]    NodeManager.ScheduleAndDispatchTasks - 3180 total (1 active), Execution time: mean = 16.941 us, total = 53.872 ms, Queueing time: mean = 74.114 us, max = 9.336 ms, min = 17.210 us, total = 235.681 ms
[state-dump]    NodeManager.deadline_timer.spill_objects_when_over_threshold - 3180 total (1 active), Execution time: mean = 1.495 us, total = 4.754 ms, Queueing time: mean = 164.827 us, max = 163.102 ms, min = 9.831 us, total = 524.149 ms
[state-dump]    NodeManagerService.grpc_server.GetResourceLoad.HandleRequestImpl - 3178 total (0 active), Execution time: mean = 76.031 us, total = 241.628 ms, Queueing time: mean = 62.734 us, max = 476.290 us, min = 8.280 us, total = 199.368 ms
[state-dump]    NodeManagerService.grpc_server.GetResourceLoad - 3178 total (0 active), Execution time: mean = 424.090 us, total = 1.348 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ClusterResourceManager.ResetRemoteNodeView - 1060 total (1 active), Execution time: mean = 7.353 us, total = 7.794 ms, Queueing time: mean = 69.101 us, max = 609.512 us, min = -0.000 s, total = 73.247 ms
[state-dump]    NodeManager.GcsCheckAlive - 636 total (1 active), Execution time: mean = 213.634 us, total = 135.871 ms, Queueing time: mean = 616.980 us, max = 163.364 ms, min = 13.945 us, total = 392.399 ms
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive.OnReplyReceived - 636 total (0 active), Execution time: mean = 30.041 us, total = 19.106 ms, Queueing time: mean = 43.623 us, max = 195.300 us, min = 8.420 us, total = 27.744 ms
[state-dump]    NodeManager.deadline_timer.record_metrics - 636 total (1 active), Execution time: mean = 211.671 us, total = 134.623 ms, Queueing time: mean = 619.509 us, max = 163.400 ms, min = 21.752 us, total = 394.008 ms
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.CheckAlive - 636 total (0 active), Execution time: mean = 1.521 ms, total = 967.068 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.debug_state_dump - 318 total (1 active), Execution time: mean = 1.586 ms, total = 504.450 ms, Queueing time: mean = 75.392 us, max = 280.141 us, min = 13.225 us, total = 23.975 ms
[state-dump]    NodeManagerService.grpc_server.GetNodeStats.HandleRequestImpl - 211 total (0 active), Execution time: mean = 44.277 us, total = 9.342 ms, Queueing time: mean = 63.088 us, max = 223.910 us, min = 24.030 us, total = 13.311 ms
[state-dump]    NodeManagerService.grpc_server.GetNodeStats - 211 total (0 active), Execution time: mean = 396.350 us, total = 83.630 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]     - 99 total (0 active), Execution time: mean = 2.940 us, total = 291.082 us, Queueing time: mean = 47.472 us, max = 115.760 us, min = 17.710 us, total = 4.700 ms
[state-dump]    RaySyncer.BroadcastMessage - 99 total (0 active), Execution time: mean = 219.674 us, total = 21.748 ms, Queueing time: mean = 288.525 ns, max = 1.060 us, min = 140.000 ns, total = 28.564 us
[state-dump]    NodeManager.deadline_timer.print_event_loop_stats - 53 total (1 active, 1 running), Execution time: mean = 811.199 us, total = 42.994 ms, Queueing time: mean = 69.617 us, max = 145.555 us, min = 20.445 us, total = 3.690 ms
[state-dump]    WorkerPool.PopWorkerCallback - 49 total (0 active), Execution time: mean = 23.252 us, total = 1.139 ms, Queueing time: mean = 7.898 us, max = 142.970 us, min = 3.340 us, total = 386.992 us
[state-dump]    PeriodicalRunner.RunFnPeriodically - 13 total (0 active), Execution time: mean = 197.107 us, total = 2.562 ms, Queueing time: mean = 3.011 ms, max = 8.719 ms, min = 27.960 us, total = 39.140 ms
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 10 total (1 active), Execution time: mean = 299.746 s, total = 2997.459 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll.OnReplyReceived - 9 total (0 active), Execution time: mean = 250.797 us, total = 2.257 ms, Queueing time: mean = 69.940 us, max = 104.340 us, min = 43.050 us, total = 629.460 us
[state-dump]    NodeManager.GCTaskFailureReason - 4 total (1 active), Execution time: mean = 5.440 us, total = 21.759 us, Queueing time: mean = 54.017 us, max = 88.242 us, min = 43.511 us, total = 216.069 us
[state-dump]    Subscriber.HandlePublishedMessage_GCS_JOB_CHANNEL - 2 total (0 active), Execution time: mean = 16.375 us, total = 32.749 us, Queueing time: mean = 238.275 us, max = 272.111 us, min = 204.440 us, total = 476.551 us
[state-dump]    Subscriber.HandlePublishedMessage_GCS_NODE_INFO_CHANNEL - 2 total (0 active), Execution time: mean = 92.085 us, total = 184.171 us, Queueing time: mean = 264.510 us, max = 270.299 us, min = 258.720 us, total = 529.019 us
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 1.360 ms, total = 2.721 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 2 total (0 active), Execution time: mean = 206.850 us, total = 413.699 us, Queueing time: mean = 522.745 us, max = 900.930 us, min = 144.559 us, total = 1.045 ms
[state-dump]    RaySyncerRegister - 2 total (0 active), Execution time: mean = 1.395 us, total = 2.789 us, Queueing time: mean = 140.000 ns, max = 200.000 ns, min = 80.000 ns, total = 280.000 ns
[state-dump]    ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 911.929 us, total = 911.929 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 28.060 us, total = 28.060 us, Queueing time: mean = 28.519 us, max = 28.519 us, min = 28.519 us, total = 28.519 us
[state-dump]    NodeManagerService.grpc_server.RequestWorkerLease.HandleRequestImpl - 1 total (0 active), Execution time: mean = 367.420 us, total = 367.420 us, Queueing time: mean = 78.500 us, max = 78.500 us, min = 78.500 us, total = 78.500 us
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 2.836 ms, total = 2.836 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 1.264 ms, total = 1.264 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 130.420 us, total = 130.420 us, Queueing time: mean = 36.750 us, max = 36.750 us, min = 36.750 us, total = 36.750 us
[state-dump]    ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.588 ms, total = 1.588 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump]    ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 18.365 ms, total = 18.365 ms, Queueing time: mean = 28.051 us, max = 28.051 us, min = 28.051 us, total = 28.051 us
[state-dump]    ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 289.119 us, total = 289.119 us, Queueing time: mean = 27.951 us, max = 27.951 us, min = 27.951 us, total = 27.951 us
[state-dump]    NodeManagerService.grpc_server.RequestWorkerLease - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump] 
[state-dump] 
[2025-03-04 08:28:43,412 E 74 74] (raylet) worker_pool.cc:581: Some workers of the worker process(10927) have not registered within the timeout. The process is dead, probably it crashed during start.
[2025-03-04 08:28:43,464 I 74 74] (raylet) runtime_env_agent_client.cc:383: Create runtime env for job 01000000
[2025-03-04 08:28:43,468 I 74 74] (raylet) worker_pool.cc:522: Started worker process with pid 11132, the token is 50

runtime_env_setup-01000000.log, worker node


(base) ray@raycluster-default-worker-mrwpg:~/tmp/session_latest/logs$ head -n 500 runtime_env_setup-01000000.log 
2025-03-04 07:38:17,290 INFO virtualenv_utils.py:104 -- Creating virtualenv at /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv, current python dir /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv
2025-03-04 07:38:17,291 INFO utils.py:76 -- Run cmd[1] ['/home/ray/anaconda3/bin/python3.11', '-m', 'virtualenv', '--app-data', '/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv_app_data', '--reset-app-data', '--no-periodic-update', '--system-site-packages', '--no-download', '/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv']
2025-03-04 07:38:17,904 INFO utils.py:101 -- Output of cmd[1]: created virtual environment CPython3.11.11.final.0-64 in 449ms
  creator CPython3Posix(dest=/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv, clear=False, no_vcs_ignore=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv_app_data)
    added seed packages: pip==24.3.1, setuptools==75.8.0, wheel==0.45.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

2025-03-04 07:38:17,906 INFO utils.py:76 -- Run cmd[2] ['/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python', '-c', '\nimport ray\nwith open(r"/tmp/check_ray_version_tempfilekjcuiymh/ray_version.txt", "wt") as f:\n    f.write(ray.__version__)\n    f.write(" ")\n    f.write(ray.__path__[0])\n                ']
2025-03-04 07:38:18,547 INFO utils.py:103 -- No output for cmd[2]
2025-03-04 07:38:18,548 INFO dependency_utils.py:57 -- try to write ray version information in: /tmp/check_ray_version_tempfilekjcuiymh/ray_version.txt
2025-03-04 07:38:18,548 INFO utils.py:76 -- Run cmd[3] ['/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python/virtualenv/bin/python', '-m', 'uv', 'version']
2025-03-04 07:38:18,553 INFO uv.py:101 -- Installing package uv to /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv
2025-03-04 07:38:18,553 INFO utils.py:76 -- Run cmd[4] ['/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', 'uv']
2025-03-04 07:38:20,330 INFO utils.py:101 -- Output of cmd[4]: Looking in indexes: https://pypi.org/simple, http://redacted-pypi-uri
WARNING: The repository located at redacted-pypi-uri is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host redacted-pypi-uri'.
Collecting uv
  Downloading uv-0.6.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.6.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.2/16.2 MB 437.5 MB/s eta 0:00:00
Installing collected packages: uv
Successfully installed uv-0.6.4

2025-03-04 07:38:20,331 INFO uv.py:184 -- Installing python requirements to /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv
2025-03-04 07:38:20,331 INFO utils.py:76 -- Run cmd[5] ['/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python', '-m', 'uv', 'pip', 'install', '-r', '/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/ray_runtime_env_internal_pip_requirements.txt']
2025-03-04 07:38:37,864 INFO utils.py:101 -- Output of cmd[5]: Using Python 3.11.11 environment at: /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv

# Removed the dependencies installation

2025-03-04 07:38:37,865 INFO utils.py:76 -- Run cmd[6] ['/home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/uv/a278079ad50f22102c3c3dc3e4606a7b0e100f0d/virtualenv/bin/python', '-c', '\nimport ray\nwith open(r"/tmp/check_ray_version_tempfilejfgol0g0/ray_version.txt", "wt") as f:\n    f.write(ray.__version__)\n    f.write(" ")\n    f.write(ray.__path__[0])\n                ']
2025-03-04 07:38:38,758 INFO utils.py:103 -- No output for cmd[6]
2025-03-04 07:38:38,758 INFO dependency_utils.py:57 -- try to write ray version information in: /tmp/check_ray_version_tempfilejfgol0g0/ray_version.txt
2025-03-04 07:38:39,652 INFO uri_cache.py:88 -- Added URI uv://a278079ad50f22102c3c3dc3e4606a7b0e100f0d with size 2336455327
2025-03-04 07:38:39,653 INFO packaging.py:713 -- Downloading package from gs://redacted-bucket-name/runtime-env/source_code.zip to /home/ray/tmp/session_2025-03-04_07-35-19_687314_1/runtime_resources/py_modules_files/gs_redacted-bucket-name_runtime-env_source_code.zip with protocol Protocol.GS
2025-03-04 07:38:40,574 INFO uri_cache.py:88 -- Added URI gs://redacted-bucket-name/runtime-env/source_code.zip with size 4627951
2025-03-04 07:39:40,645 INFO plugin.py:257 -- Runtime env uv uv://a278079ad50f22102c3c3dc3e4606a7b0e100f0d is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-03-04 07:39:40,645 INFO uri_cache.py:75 -- Marked URI uv://a278079ad50f22102c3c3dc3e4606a7b0e100f0d used.
2025-03-04 07:39:40,646 INFO plugin.py:257 -- Runtime env py_modules gs://redacted-bucket-name/runtime-env/source_code.zip is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-03-04 07:39:40,646 INFO uri_cache.py:75 -- Marked URI gs://redacted-bucket-name/runtime-env/source_code.zip used.
2025-03-04 07:40:40,711 INFO plugin.py:257 -- Runtime env uv uv://a278079ad50f22102c3c3dc3e4606a7b0e100f0d is already installed and will be reused. Search all runtime_env_setup-*.log to find the corresponding setup log.
2025-03-04 07:40:40,711 INFO uri_cache.py:75 -- Marked URI uv://a278079ad50f22102c3c3dc3e4606a7b0e100f0d used.

Note that the setup above uses latest Ray 2.43.0 version, but the same problem happened with the previous 2.41.0 version too, so it does not look to be caused by the upgrade.

Any ideas on what I should try to solve it? :slight_smile:

The error makes it seem like Ray is not installed properly or the worker is not finding Ray when it starts.

A few questions

  • Can this be reproduced with a slimmed down example (e.g., only a head node, no workers), minimal Ray application code?
  • Seems like the worker processes are not finding Ray… could the worker somehow have a different Python path or environment from the driver?

After spending the whole day digging into it, I found the problem and it was a bit tricky!

Project I’m working on requires extra build steps before becoming an importable Python module, so I can’t use working_dir directly. After the build process, I get a zip archive with a directory tree that looks like my_project/__init__.py, my_project/module1.py, and so on. This archive then is uploaded to GCS and added to the runtime environment with remote URI.

What I found is that when Ray downloads and unpacks this archive on the head or worker node, folder structure looks like this:

(base) ray@raycluster-head-qd87q:~$ ls ~/tmp/session_latest/runtime_resources/py_modules_files/gs_redacted-bucket-name_runtime-env_source-code
__init__.py  module1.py

Note that the my_project folder does not exist here anymore, and PYTHONPATH points directly to this folder:

PYTHONPATH=/home/ray/tmp/session_2025-03-05_08-16-59_056740_1/runtime_resources/py_modules_files/gs_redacted-bucket-name_runtime-env_source-code

So what actually happened here, is that I have my_project/ray.py module with a wrapper on top of the Job Submission SDK.
And with PYTHONPATH looking like this, it forced Python to import ray not from the globally installed package, but from my /home/ray/tmp/session_2025-03-05_08-16-59_056740_1/runtime_resources/py_modules_files/gs_redacted-bucket-name_runtime-env_source-code/ray.py instead.

An obvious solution to this problem is to create an archive with two nested directories, my_project/my_project/__init__.py, so the top one will be “eaten” by Ray and I still will be able to run python -m my_project afterward.

It is actually documented in the remote URIs documentation, but I guess I didn’t realized that how it will affect my setup:

The zip file must contain only a single top-level directory. The contents of this directory will be directly accessed as the working_dir or a py_module.

Thanks for your help, @rkn!

Ok, that makes sense, thanks for sharing the detailed writeup!

In retrospect, the fact that it seemed to be importing ray successfully but not ray._private hints at the existence of another ray module.