Ray k8s cluster, cannot run new task when previous task failed

By the way, did your team test the ray k8s cluster in your team? I found there are no clear docs on the k8s ray cluster. Actually, I tried many times to make the ray k8s cluster work.

We don’t use the ray k8s cluster of community. Maybe someone can help? Or you can create a new discussion in Ray Clusters - Ray?

Thanks. I would like to seek help from the ray cluster panel.

@GoingMyWay any luck with this?

I’m also not super familiar with the k8s side of things, @Dmitri can probably confirm if I’m missing anything here. But from what you’ve posted, you’re saying the components folder is already copied to the local disk of each node before Ray starts, so you don’t need runtime_env, which makes sense. So it would be helpful to verify if components is importable by Python on the worker nodes (not just the head node). If it fails, it would be good to see the output of python -c "import sys; print(sys.path)" as well since that shows the folders where Python imports from.

Depending on the outcome, there are some other paths we can go down, like starting a bunch of Ray tasks that have print(sys.path) and import components in their body, and possibly also printing some details about the local disk (like whether their local components folder is present)

Some worker pod debugging as suggested above could help.
(Though strictly speaking, kubectl exec would be the way to go, not SSH.)

Yes. I actually use the following command to connect the pod

kubectl exec -it ray-cluster-i-ray-head-7jxs8 -n ray-cluster -- bash 

@architkulkarni , note that components is a private module of my project. In the workspace, I can actually import it. I do not need to install it. The following is the main directories of the code

# show workspace
$ pwd
/home/me/app
$ cd /home/me/app && tree
|-- docker
|-- myproject
|   |-- src
|   |   |-- components
|   |   |-- config
|   |   |   |-- algs
|   |   |   `-- envs
|   |   |-- controllers
|   |   |-- envs
|   |   |-- learners
|   |   |-- pretrained
|   |   |-- runners
|   |   |-- training_methods
|   |   |-- train_model_ray
|   |   `-- utils
|-- k8s
|   |-- depoly
|   |   `-- charts
|   |       `-- ray
|   |           |-- crds
|   |           `-- templates
|   `-- ray_cluster
|-- scripts
|   |-- cluster
|   |-- run_jobs
|   |-- submit_jobs
|   `-- tools
|   `-- utils

As you can see it, the components is in /home/me/app/myproject/components. Then in the workspace, I run the code

python myproject/src/main.py 

That is how can run the code.

Yup, agree you don’t need to install anything!

Could you help me understand the concept of “workspace”–does that mean a copy of this local filesystem is present on each node in the cluster? And when you run python myproject/src/main.py , is this happening on the same node as the Ray head node?

The hypothesis we’re trying to figure out is if components is importable on the head node but not importable on the worker nodes for some reason.

@architkulkarni, the workspace is the path where my code is. By using k8s, the code can be mounted to the head node and the worker nodes.

I run the code in the head node.

@GoingMyWay Thanks for the details–I think I have a better idea now. Can you try ray.init(address=<your address>, runtime_env={"env_vars": {"PYTHONPATH": "home/me/app/myproject/src/"}})?

My guess is that Ray Python processes are running in different directories depending on whether they’re started on the head node or on worker nodes. I suspect if you print sys.path in a Ray task, I think it will show different directories depending on whether it’s started on the head node or the worker node. By setting the environment variable above, we guarantee that "home/me/app/myproject/src/" will be appended to sys.path in every Ray task/actor, so Python will search the src directory for imports.

(In general, the recommended path is to use the runtime_env "working_dir" option, since that handles both syncing files to the cluster node and setting the cwd and PYTHONPATH for you. But since you already have the files synced to every node, you can just use the "env_vars" approach above.)

@architkulkarni, Thanks. I tried to set the path. However, the jobs shows

(pid=gcs_server) [2022-06-30 09:41:35,328 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 9d6eb47ea04a1bd347b7038486b1f74939be7b8bece79c5fdfa79d06 for actor c4cd3073a7746c52
19194a2401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED                                                                           
(pid=gcs_server) [2022-06-30 09:41:35,340 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node f04b5ee2e409c3f8319b178ea573e4e78203ec9324a9cfae8d1ef1ed for actor c3d3ee366f2148fd
1a54b1d201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

And then the job failed.

@architkulkarni, Hi, can you ask the one who developed the ray cluster to help me?

Appreciate your patience! For now I’m the best equipped to help on this issue, unless we determine the current hypothesis is invalid.

The details of the runtime_env setup failure will be in dashboard_agent.py and runtime_env_setup-*.log. By default these are on /tmp/ray/session_latest/logs on the head node. Are there any details in those log files?

Hi, @architkulkarni, the following is the output of dashboard_agent.log

2022-07-04 09:53:30,353	INFO head.py:141 -- Dashboard head grpc address: 0.0.0.0:36165
2022-07-04 09:53:30,353	INFO dashboard.py:95 -- Setup static dir for dashboard: /home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build
2022-07-04 09:53:30,357	INFO utils.py:79 -- Get all modules by type: DashboardHeadModule
2022-07-04 09:53:30,616	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.log.log_head.LogHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.tune.tune_head.TuneController'>
2022-07-04 09:53:30,619	INFO head.py:186 -- Loaded 8 modules.
2022-07-04 09:53:30,621	INFO head.py:273 -- Dashboard head http address: 172.24.20.109:8265
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /> -> <function Dashboard.get_index at 0x7fcf2a2d1af0>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /favicon.ico> -> <function Dashboard.get_favicon at 0x7fcf2a2d1c10>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <StaticResource  /static -> PosixPath('/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource  /static -> PosixPath('/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build/static')>>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x7fcf28a30040>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/actors> -> <function ActorHead.get_all_actors[cache ttl=2, max_size=128] at 0x7fcf28a30160>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/kill_actor> -> <function ActorHead.kill_actor at 0x7fcf28a30310>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /events> -> <function EventHead.get_event[cache ttl=2, max_size=128] at 0x7fcf28a389d0>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/version> -> <function JobHead.get_version at 0x7fcf2898b790>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.get_package at 0x7fcf2898b940>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [PUT] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.upload_package at 0x7fcf2898baf0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [POST] <PlainResource  /api/jobs/> -> <function JobHead.submit_job at 0x7fcf2898bca0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [POST] <DynamicResource  /api/jobs/{job_id}/stop> -> <function JobHead.stop_job at 0x7fcf2898be50>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}> -> <function JobHead.get_job_status at 0x7fcf2898d040>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs> -> <function JobHead.get_job_logs at 0x7fcf2898d1f0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs/tail> -> <function JobHead.tail_job_logs at 0x7fcf2898d3a0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /log_index> -> <function LogHead.get_log_index at 0x7fcf2898dd30>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /log_proxy> -> <function LogHead.get_log_from_proxy at 0x7fcf2898de50>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /nodes> -> <function NodeHead.get_all_nodes[cache ttl=2, max_size=128] at 0x7fcf28994940>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /nodes/{node_id}> -> <function NodeHead.get_node[cache ttl=2, max_size=128] at 0x7fcf28994af0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /memory/memory_table> -> <function NodeHead.get_memory_table at 0x7fcf28994ca0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /memory/set_fetch> -> <function NodeHead.set_fetch_memory_info at 0x7fcf28994dc0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /node_logs> -> <function NodeHead.get_logs at 0x7fcf28994ee0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /node_errors> -> <function NodeHead.get_errors at 0x7fcf2899c040>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/launch_profiling> -> <function ReportHead.launch_profiling at 0x7fcf287e4160>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/ray_config> -> <function ReportHead.get_ray_config at 0x7fcf287e4280>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x7fcf287e43a0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/actors/kill> -> <function APIHead.kill_actor_gcs at 0x7fcf287edaf0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/snapshot> -> <function APIHead.snapshot at 0x7fcf287edc10>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/info> -> <function TuneController.tune_info at 0x7fcf035c31f0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/availability> -> <function TuneController.get_availability at 0x7fcf035c3310>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/set_experiment> -> <function TuneController.set_tune_experiment at 0x7fcf035c3430>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/enable_tensorboard> -> <function TuneController.enable_tensorboard at 0x7fcf035c3550>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-07-04_09-53-28_681369_59/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-07-04_09-53-28_681369_59/logs')>>
2022-07-04 09:53:30,625	INFO head.py:292 -- Registered 33 routes.
2022-07-04 09:53:30,625	INFO datacenter.py:70 -- Purge data.
2022-07-04 09:53:30,626	INFO event_utils.py:123 -- Monitor events logs modified after 1656897810.3644762 on /tmp/ray/session_2022-07-04_09-53-28_681369_59/logs/events, the source types are ['GCS'].
2022-07-04 09:53:30,627	INFO actor_head.py:75 -- Getting all actor info from GCS.
2022-07-04 09:53:30,629	INFO actor_head.py:101 -- Received 0 actor info from GCS.
2022-07-04 09:58:18,303	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899898.303532309","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899898.303530295","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:19,316	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899899.316458121","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899899.316456588","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:20,330	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899900.330091457","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899900.330089874","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:21,341	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899901.340704155","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899901.340702983","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:22,354	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899902.354139193","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899902.354137330","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:23,367	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899903.367158778","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899903.367157265","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:24,378	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899904.378699613","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899904.378698571","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:25,390	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899905.390624131","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899905.390622498","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:26,401	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899906.401699990","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899906.401698187","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:27,417	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899907.417583702","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899907.417581879","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:28,431	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899908.431329908","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899908.431328265","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:29,446	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899909.445987670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899909.445986207","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:30,456	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899910.456556777","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899910.456555304","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:31,466	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899911.466725219","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899911.466723656","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:32,480	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899912.479820545","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899912.479819062","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:33,492	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899913.492571851","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899913.492570288","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:34,505	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899914.505068924","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899914.505067321","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
....

The following is the content of the dashboard.log:

2022-07-04 09:53:30,353	INFO head.py:141 -- Dashboard head grpc address: 0.0.0.0:36165
2022-07-04 09:53:30,353	INFO dashboard.py:95 -- Setup static dir for dashboard: /home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build
2022-07-04 09:53:30,357	INFO utils.py:79 -- Get all modules by type: DashboardHeadModule
2022-07-04 09:53:30,616	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.actor.actor_head.ActorHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.event.event_head.EventHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.job.job_head.JobHead'>
2022-07-04 09:53:30,617	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.log.log_head.LogHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.node.node_head.NodeHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.reporter.reporter_head.ReportHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.snapshot.snapshot_head.APIHead'>
2022-07-04 09:53:30,619	INFO head.py:181 -- Loading DashboardHeadModule: <class 'ray.dashboard.modules.tune.tune_head.TuneController'>
2022-07-04 09:53:30,619	INFO head.py:186 -- Loaded 8 modules.
2022-07-04 09:53:30,621	INFO head.py:273 -- Dashboard head http address: 172.24.20.109:8265
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /> -> <function Dashboard.get_index at 0x7fcf2a2d1af0>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /favicon.ico> -> <function Dashboard.get_favicon at 0x7fcf2a2d1c10>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <StaticResource  /static -> PosixPath('/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource  /static -> PosixPath('/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/client/build/static')>>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/actor_groups> -> <function ActorHead.get_actor_groups at 0x7fcf28a30040>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/actors> -> <function ActorHead.get_all_actors[cache ttl=2, max_size=128] at 0x7fcf28a30160>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /logical/kill_actor> -> <function ActorHead.kill_actor at 0x7fcf28a30310>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /events> -> <function EventHead.get_event[cache ttl=2, max_size=128] at 0x7fcf28a389d0>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/version> -> <function JobHead.get_version at 0x7fcf2898b790>
2022-07-04 09:53:30,623	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.get_package at 0x7fcf2898b940>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [PUT] <DynamicResource  /api/packages/{protocol}/{package_name}> -> <function JobHead.upload_package at 0x7fcf2898baf0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [POST] <PlainResource  /api/jobs/> -> <function JobHead.submit_job at 0x7fcf2898bca0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [POST] <DynamicResource  /api/jobs/{job_id}/stop> -> <function JobHead.stop_job at 0x7fcf2898be50>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}> -> <function JobHead.get_job_status at 0x7fcf2898d040>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs> -> <function JobHead.get_job_logs at 0x7fcf2898d1f0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /api/jobs/{job_id}/logs/tail> -> <function JobHead.tail_job_logs at 0x7fcf2898d3a0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /log_index> -> <function LogHead.get_log_index at 0x7fcf2898dd30>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /log_proxy> -> <function LogHead.get_log_from_proxy at 0x7fcf2898de50>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /nodes> -> <function NodeHead.get_all_nodes[cache ttl=2, max_size=128] at 0x7fcf28994940>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <DynamicResource  /nodes/{node_id}> -> <function NodeHead.get_node[cache ttl=2, max_size=128] at 0x7fcf28994af0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /memory/memory_table> -> <function NodeHead.get_memory_table at 0x7fcf28994ca0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /memory/set_fetch> -> <function NodeHead.set_fetch_memory_info at 0x7fcf28994dc0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /node_logs> -> <function NodeHead.get_logs at 0x7fcf28994ee0>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /node_errors> -> <function NodeHead.get_errors at 0x7fcf2899c040>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/launch_profiling> -> <function ReportHead.launch_profiling at 0x7fcf287e4160>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/ray_config> -> <function ReportHead.get_ray_config at 0x7fcf287e4280>
2022-07-04 09:53:30,624	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x7fcf287e43a0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/actors/kill> -> <function APIHead.kill_actor_gcs at 0x7fcf287edaf0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /api/snapshot> -> <function APIHead.snapshot at 0x7fcf287edc10>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/info> -> <function TuneController.tune_info at 0x7fcf035c31f0>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/availability> -> <function TuneController.get_availability at 0x7fcf035c3310>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/set_experiment> -> <function TuneController.set_tune_experiment at 0x7fcf035c3430>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <PlainResource  /tune/enable_tensorboard> -> <function TuneController.enable_tensorboard at 0x7fcf035c3550>
2022-07-04 09:53:30,625	INFO head.py:291 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-07-04_09-53-28_681369_59/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2022-07-04_09-53-28_681369_59/logs')>>
2022-07-04 09:53:30,625	INFO head.py:292 -- Registered 33 routes.
2022-07-04 09:53:30,625	INFO datacenter.py:70 -- Purge data.
2022-07-04 09:53:30,626	INFO event_utils.py:123 -- Monitor events logs modified after 1656897810.3644762 on /tmp/ray/session_2022-07-04_09-53-28_681369_59/logs/events, the source types are ['GCS'].
2022-07-04 09:53:30,627	INFO actor_head.py:75 -- Getting all actor info from GCS.
2022-07-04 09:53:30,629	INFO actor_head.py:101 -- Received 0 actor info from GCS.
2022-07-04 09:58:18,303	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899898.303532309","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899898.303530295","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:19,316	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899899.316458121","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899899.316456588","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:20,330	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899900.330091457","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899900.330089874","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:21,341	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899901.340704155","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899901.340702983","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:22,354	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899902.354139193","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899902.354137330","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:23,367	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899903.367158778","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899903.367157265","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:24,378	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899904.378699613","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899904.378698571","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:25,390	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899905.390624131","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899905.390622498","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:26,401	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899906.401699990","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899906.401698187","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:27,417	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899907.417583702","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899907.417581879","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:28,431	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899908.431329908","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899908.431328265","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:29,446	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899909.445987670","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899909.445986207","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:30,456	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899910.456556777","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899910.456555304","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:31,466	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899911.466725219","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899911.466723656","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:32,480	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899912.479820545","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899912.479819062","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:33,492	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899913.492571851","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899913.492570288","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:40,581	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899920.581468628","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899920.581467195","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-07-04 09:58:41,592	ERROR node_head.py:242 -- Error updating node stats of 0fe754916d67eaedaf08481fedfa14289352229d0939c16ac91bb3da.
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/node/node_head.py", line 235, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/me/miniconda3/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1656899921.592371095","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1656899921.592369682","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

2022-07-04 10:03:30,626	INFO datacenter.py:70 -- Purge data.
2022-07-04 10:13:30,629	INFO datacenter.py:70 -- Purge data.
2022-07-04 10:23:30,631	INFO datacenter.py:70 -- Purge data.

There is no runtime_env_setup-*.log in /tmp/ray/session_latest/logs.

Thanks for posting the details. It think those error messages might just be from the cluster shutdown, if you started it at 9:53 and shut it down at 9:58.

I’m really surprised there’s no runtime_env related logs given that there was a RuntimeEnvSetupFailure. Just to be sure, was ray.init() called with a runtime_env argument in that test?

If all else fails, you can set the PYTHONPATH environment variable in your ray_cluster.yaml file, similar to what you posted earlier

              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"

, or by using any other preferred method of setting an environment variable on all nodes. You’ll want to set it to the directory where you want Python to look for packages and modules.

To check that the env var is set correctly, you can print or return os.environ and sys.path inside a Ray task.

Hi, @architkulkarni, yes, the runtime_env was set correctly.

I will give it a try and give you the feedback.

Hi, @architkulkarni, it failed to launch again (first run) by setting env_runtime. I got the same error (see this link: Ray k8s cluster, cannot run new task when previous task failed - #27 by GoingMyWay) Here is the log of the python driver

me@ray-cluster-02-ray-head-rwb47:/tmp/ray/session_latest/logs$ cat python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_231.log  
[2022-07-06 09:25:08,135 I 231 231] core_worker_process.cc:120: Constructing CoreWorkerProcess. pid: 231
[2022-07-06 09:25:08,141 I 231 231] grpc_server.cc:103: driver server started, listening on port 10002.
[2022-07-06 09:25:08,145 I 231 231] core_worker.cc:157: Initializing worker at address: 172.24.26.213:10002, worker ID 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 243c74b90855daf0956857e0dc44cd31daafced8e2aeca7606da667b
[2022-07-06 09:25:08,145 I 231 521] gcs_server_address_updater.cc:31: GCS Server updater thread id: 140656445990656
[2022-07-06 09:25:08,248 I 231 231] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-07-06 09:25:08,249 I 231 523] core_worker.cc:452: Event stats:


Global stats: 14 total (8 active)
Queueing time: mean = 484.286 us, max = 1.650 ms, min = 725.164 us, total = 6.780 ms
Execution time:  mean = 77.303 us, total = 1.082 ms
Event stats:
	PeriodicalRunner.RunFnPeriodically - 5 total (1 active, 1 running), CPU time: mean = 93.829 us, total = 469.146 us
	UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 570.765 us, total = 570.765 us
	CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 42.329 us, total = 42.329 us
	NodeManagerService.grpc_client.ReportWorkerBacklog - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	GcsClient.deadline_timer.check_gcs_service_address - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-07-06 09:25:08,249 I 231 523] accessor.cc:621: Received notification for node id = 28773d2ee3897743162ed1a9c74d974b01e9683f8b11c7e8b0820750, IsAlive = 1
[2022-07-06 09:25:08,249 I 231 523] accessor.cc:621: Received notification for node id = 243c74b90855daf0956857e0dc44cd31daafced8e2aeca7606da667b, IsAlive = 1
[2022-07-06 09:25:08,936 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor ad648bfe7f181d0994773a9301000000
[2022-07-06 09:25:08,939 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: ad648bfe7f181d0994773a9301000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,939 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: ad648bfe7f181d0994773a9301000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,940 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: ad648bfe7f181d0994773a9301000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,952 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 8b56ffb91d98d28a3884a36501000000
[2022-07-06 09:25:08,952 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 8b56ffb91d98d28a3884a36501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,953 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 8b56ffb91d98d28a3884a36501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,953 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 8b56ffb91d98d28a3884a36501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,962 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 65530ea76dae821b0bb3fa1e01000000
[2022-07-06 09:25:08,962 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 65530ea76dae821b0bb3fa1e01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,963 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 65530ea76dae821b0bb3fa1e01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,963 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 65530ea76dae821b0bb3fa1e01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,971 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 838d1a24a5ef64261773ae0c01000000
[2022-07-06 09:25:08,971 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 838d1a24a5ef64261773ae0c01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,972 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 838d1a24a5ef64261773ae0c01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,972 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 838d1a24a5ef64261773ae0c01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,980 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor cd26919de974b64ae27de2ea01000000
[2022-07-06 09:25:08,981 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: cd26919de974b64ae27de2ea01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,981 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: cd26919de974b64ae27de2ea01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,982 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: cd26919de974b64ae27de2ea01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,989 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor f2b853940c6fc1ef922c6e2001000000
[2022-07-06 09:25:08,990 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f2b853940c6fc1ef922c6e2001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,990 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f2b853940c6fc1ef922c6e2001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,990 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: f2b853940c6fc1ef922c6e2001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,998 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor b149d99d0b6a5ca4bd52d48a01000000
[2022-07-06 09:25:08,998 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: b149d99d0b6a5ca4bd52d48a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,999 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: b149d99d0b6a5ca4bd52d48a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:08,999 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: b149d99d0b6a5ca4bd52d48a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,007 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor d306971aaa20e5b38d2ae39a01000000
[2022-07-06 09:25:09,007 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d306971aaa20e5b38d2ae39a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,008 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d306971aaa20e5b38d2ae39a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,008 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: d306971aaa20e5b38d2ae39a01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,015 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor fe43d2f7722b6c132148d1c901000000
[2022-07-06 09:25:09,016 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: fe43d2f7722b6c132148d1c901000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,016 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: fe43d2f7722b6c132148d1c901000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,017 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: fe43d2f7722b6c132148d1c901000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,024 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 21693f19d06cccbcc186163401000000
[2022-07-06 09:25:09,025 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 21693f19d06cccbcc186163401000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,025 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 21693f19d06cccbcc186163401000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,025 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 21693f19d06cccbcc186163401000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,033 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 474602d4fa50b1a0b16c798001000000
[2022-07-06 09:25:09,034 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 474602d4fa50b1a0b16c798001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,034 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 474602d4fa50b1a0b16c798001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,034 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 474602d4fa50b1a0b16c798001000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,042 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 1d6f4f0652ff661153c2ba2101000000
[2022-07-06 09:25:09,042 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 1d6f4f0652ff661153c2ba2101000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,043 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 1d6f4f0652ff661153c2ba2101000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,043 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 1d6f4f0652ff661153c2ba2101000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,051 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 631c1dcdd0592d85ad81cc1201000000
[2022-07-06 09:25:09,052 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 631c1dcdd0592d85ad81cc1201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,052 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 631c1dcdd0592d85ad81cc1201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,052 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 631c1dcdd0592d85ad81cc1201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,060 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor d10082dafa70514217b62f2b01000000
[2022-07-06 09:25:09,061 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d10082dafa70514217b62f2b01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,061 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d10082dafa70514217b62f2b01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,061 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: d10082dafa70514217b62f2b01000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,069 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor f4de6475fb41a76efc09eb6801000000
[2022-07-06 09:25:09,069 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f4de6475fb41a76efc09eb6801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,070 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f4de6475fb41a76efc09eb6801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,070 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: f4de6475fb41a76efc09eb6801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,078 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor f76ab9e1dcf3208f10b68f8501000000
[2022-07-06 09:25:09,078 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f76ab9e1dcf3208f10b68f8501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,079 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: f76ab9e1dcf3208f10b68f8501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,079 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: f76ab9e1dcf3208f10b68f8501000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,087 W 231 231] actor_manager.cc:93: Failed to look up actor with name 'QueueActor'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
[2022-07-06 09:25:09,095 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 64179f610a69fef37416f80801000000
[2022-07-06 09:25:09,095 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 64179f610a69fef37416f80801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,096 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 64179f610a69fef37416f80801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,097 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: 64179f610a69fef37416f80801000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,098 W 231 231] actor_manager.cc:93: Failed to look up actor with name 'Buffer'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
[2022-07-06 09:25:09,115 I 231 231] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor d98cd858706ccd55efbf343201000000
[2022-07-06 09:25:09,115 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d98cd858706ccd55efbf343201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,116 I 231 523] actor_manager.cc:219: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: d98cd858706ccd55efbf343201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,117 I 231 523] actor_manager.cc:219: received notification on actor, state: PENDING_CREATION, actor_id: d98cd858706ccd55efbf343201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-07-06 09:25:09,118 I 231 523] actor_manager.cc:219: received notification on actor, state: DEAD, actor_id: d98cd858706ccd55efbf343201000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: 243c74b90855daf0956857e0dc44cd31daafced8e2aeca7606da667b, num_restarts: 0, death context type=RuntimeEnvFailedContext
[2022-07-06 09:25:09,118 I 231 523] direct_actor_task_submitter.cc:260: Failing pending tasks for actor d98cd858706ccd55efbf343201000000 because the actor is already dead.
[2022-07-06 09:25:09,119 I 231 523] task_manager.cc:414: Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=components.episode_buffer, class_name=ReplayBufferwithQueue, function_name=ready, function_hash=}, task_id=239c2f70c73fbf73d98cd858706ccd55efbf343201000000, task_name=ReplayBufferwithQueue.ready(), job_id=01000000, num_args=0, num_returns=2, depth=0, actor_task_spec={actor_id=d98cd858706ccd55efbf343201000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=0}, serialized_runtime_env={"uris": {"workingDirUri": "gcs://_ray_pkg_a12f4a1fd024b787.zip"}, "workingDir": "gcs://_ray_pkg_a12f4a1fd024b787.zip"}, runtime_env_uris=working_dir|gcs://_ray_pkg_a12f4a1fd024b787.zip, runtime_env_eager_install=0
[2022-07-06 09:25:09,119 I 231 523] direct_actor_task_submitter.cc:278: Failing tasks waiting for death info, size=0, actor_id=d98cd858706ccd55efbf343201000000
[2022-07-06 09:25:09,146 W 231 521] gcs_server_address_updater.cc:63: [1] Failed to get the gcs server address from raylet 1 times in a row. If it keeps failing to obtain the address, the worker might crash. Connection status GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
[2022-07-06 09:25:10,146 I 231 523] raylet_client.cc:328: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
[2022-07-06 09:25:10,249 I 231 523] raylet_client.cc:328: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
[2022-07-06 09:25:10,256 I 231 231] core_worker.cc:527: Disconnecting to the raylet.
[2022-07-06 09:25:10,256 I 231 231] raylet_client.cc:150: RayletClient::Disconnect, exit_type=INTENDED_EXIT, has creation_task_exception_pb_bytes=0
[2022-07-06 09:25:10,256 W 231 231] raylet_client.cc:173: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet. This means the raylet the worker is connected is probably already dead.
[2022-07-06 09:25:10,256 I 231 231] core_worker.cc:473: Shutting down a core worker.
[2022-07-06 09:25:10,257 I 231 231] core_worker.cc:490: Disconnecting a GCS client.
[2022-07-06 09:25:10,257 I 231 231] core_worker.cc:494: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-07-06 09:25:10,257 I 231 523] core_worker.cc:615: Core worker main io service stopped.
[2022-07-06 09:25:10,257 I 231 231] core_worker.cc:512: Core worker ready to be deallocated.
[2022-07-06 09:25:10,257 I 231 231] core_worker_process.cc:293: Removed worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2022-07-06 09:25:10,257 I 231 231] core_worker.cc:464: Core worker is destructed
[2022-07-06 09:25:10,509 I 231 231] core_worker_process.cc:153: Destructing CoreWorkerProcessImpl. pid: 231
[2022-07-06 09:25:10,509 I 231 231] io_service_pool.cc:47: IOServicePool is stopped.

And there is no runtime_env log.

I think using k8s with docker, there is no need to set runtimes. It is an anti-pattern.

@architkulkarni, @Chen_Shen, @GuyangSong, @yic, @Dmitri. Hi, the problem is solved. The following is the solution. The key problem is that the code (modules) of my project is not in the PYTHONPATH. I add the following line in my Dockerfile. /home/me/myproject/src contains components module of my project (see this post: Ray k8s cluster, cannot run new task when previous task failed - #47 by GoingMyWay).

ENV PYTHONPATH "${PYTHONPATH}:/home/me/myproject/src"

In my code, simply initialize the ray without setting runtime_env.

ray.init("auto", ignore_reinit_error=True, include_dashboard=False)

Run the project in the head node pod

$ python main.py 

After a few minutes, the job is still running. Kill the job by ctrl+c. Then, run the above command to run the job again. The cluster can be reused and it works.

For ray cluster on k8s with docker, as all the code has been mounted onto all nodes, there is no need to set runtime_env. The key issue is that, although all code is in all nodes, ray can run the code in the first run, it will fail in the second run due to the lack of the right PYTHONPATH (thanks @architkulkarni for pointing out it: Ray k8s cluster, cannot run new task when previous task failed - #51 by architkulkarni).

Thanks for the help over the past ~30 days.