Hi, I launched a Ray k8s cluster and debug my code with the cluster. However, when my task failed and I restarted the task again, the cluster cannot be used and I had to launch a new cluster. The reported errors are:
2022-06-10 12:41:05,238 WARNING worker.py:1326 -- Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 602, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 624, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 456, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: No module named 'components'
traceback: Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/serialization.py", line 309, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/serialization.py", line 215, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/serialization.py", line 174, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/serialization.py", line 164, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'components'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 774, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 595, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 715, in ray._raylet.execute_task
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 544, in temporary_actor_method
raise RuntimeError(
RuntimeError: The actor with name ReplayBuffer failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/function_manager.py", line 603, in _load_actor_class_from_gcs
actor_class = pickle.loads(pickled_class)
ModuleNotFoundError: No module named 'components'
Any suggestions to solve this problem?
UPDATE: 26 June 2022, problem unsolved. Change category from Ray Core to Ray Clusters and Kubernetes.
By the way, this error frustrates me because if my task fails, I need to launch a new cluster, which will cost a lot of time. Will you make this easy to use, since there is an auto option in ray.init?
@GoingMyWay I read your question again and I feel it’s very wired:
However, when my task failed and I restarted the task again, the cluster cannot be used and I had to launch a new cluster.
Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.
By the way, this error frustrates me because if my task fails, I need to launch a new cluster, which will cost a lot of time.
I don’t think you need to relaunch a cluster to rerun the job in either way. For example, if ray.init with working dir works, it should be ok if you just rerun the task.
If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?
Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.
Yes. I restarted my job in the same directory.
If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?
I think the reason may be the ray cluster operator was created by the cluster administrator. I can give you the ray_cluster.yaml file which was used to create a new k8s ray cluster. I will try my best to put the reproducible code. I think the code is nothing different. I use ray.init("auto") to initialize ray.
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster
spec:
# The maximum number of workers nodes to launch in addition to the head node.
maxWorkers: 100
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
idleTimeoutMinutes: 5
# Specify the pod type for the ray head node (as configured below).
headPodType: rayHead
# Specify the allowed pod types for this ray cluster and the resources they provide.
podTypes:
- name: rayHead
minWorkers: 0
maxWorkers: 0
rayResources: {}
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-head-
spec:
imagePullSecrets:
- name: gitlab-cr-pull-secret
- name: regcred
priorityClassName: high
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: workspace-vol
hostPath:
path: /mnt/home/%USER/Projects/work_dir
type: Directory
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: "the.image:tag"
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: "1"
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- name: workspace-vol
mountPath: /home/me/app/
readOnly: false
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 10
memory: 100Gi
nvidia.com/gpu: 1
limits:
cpu: 10
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 50Gi
nvidia.com/gpu: 1
nodeSelector: {}
tolerations: []
- name: rayWorker
minWorkers: 2
maxWorkers: 2
rayResources: {}
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-worker-
spec:
imagePullSecrets:
- name: gitlab-cr-pull-secret
- name: regcred
priorityClassName: high
restartPolicy: Never
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumes:
- name: workspace-vol
hostPath:
path: /mnt/home/%USER/Projects/work_dir
type: Directory
- name: dshm
emptyDir:
medium: Memory
containers:
- name: ray-node
imagePullPolicy: Always
image: "the.image:tag"
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; sleep infinity & wait;"]
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: "1"
ports:
- containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
- containerPort: 10001 # Used by Ray Client
- containerPort: 8265 # Used by Ray Dashboard
- containerPort: 8000 # Used by Ray Serve
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
volumeMounts:
- name: workspace-vol
mountPath: /home/me/app/
readOnly: false
- mountPath: /dev/shm
name: dshm
resources:
requests:
cpu: 33
memory: 50Gi
nvidia.com/gpu: 0
limits:
cpu: 33
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
# redis, 30% for the shared memory object store, and the
# rest for application memory. If this limit is not set and
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
memory: 100Gi
nvidia.com/gpu: 0
nodeSelector: {}
tolerations: []
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0
# Commands to start Ray on worker nodes. You don't need to change this.
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
Thanks for the details here. So the cluster is started by the admin and you login to one of the workers and call ray.init right?
I notice the actual error is
ModuleNotFoundError: No module named 'components'
Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?
One of the reasons I can think of is that the first time when you ran it, the ray python worker starts on the node your local module is, and the next time when you ran it (not restart the cluster), it got scheduled to another worker. The root cause might be that the two workers are running in a different environment which in the end makes you see different results.
Hi @yic, thanks for the response. Note that the admin deployed the cluster and created a namespace. Then, I can use kubectl -n ray_cluster apply -f ray_cluster.yaml to create a new cluster for me to use. After that, I can login the head node and then run the code.
Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?
It seems the components is one private module of my project. I will try to use the runtime env and report the results to you.
However, it seems ray cannot really exclude .git and reported the following error:
2022-06-15 15:24:00,301 INFO packaging.py:363 -- Creating a file package for local directory '/home/me/app'.
2022-06-15 15:24:00,656 WARNING packaging.py:259 -- File /home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack is very large (42.68MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack']})`
2022-06-15 15:24:01,745 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack is very large (158.96MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack']})`
2022-06-15 15:24:02,050 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack is very large (190.81MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack']})`
[ERROR 15:24:11] pymarl Failed after 0:00:21!
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/me/app/epymarl/src/main.py", line 65, in my_main
run_train_meltingpot(_run, config, _log)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 60, in run
run_sequential(args=args, logger=logger)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 104, in run_sequential
ray.init("auto", runtime_env=runtime_env)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 977, in init
connect(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1517, in connect
runtime_env = upload_working_dir_if_needed(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/working_dir.py", line 64, in upload_working_dir_if_needed
upload_package_if_needed(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 411, in upload_package_if_needed
upload_package_to_gcs(pkg_uri, package_file.read_bytes())
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 343, in upload_package_to_gcs
_store_package_in_gcs(pkg_uri, pkg_bytes)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 218, in _store_package_in_gcs
raise RuntimeError(
RuntimeError: Package size (532.13MiB) exceeds the maximum size of 100.00MiB. You can exclude large files using the 'excludes' option to the runtime_env.
You can find that excludes seems cannot work.
I tried to add that big file to excludes, and it still reported the same error. I seems excludes cannot work.
Also, if you think we should modify the default behavior, please feel free to leave a comment here or in the issue. It seems like we can’t support both absolute paths and also support gitignore syntax, because they have conflicting meanings for paths that start with /. So we need to pick a reasonable default, or find a compromise somehow…
Hi, @architkulkarni thanks. I think the runtime env is not what I want. I use docker and created a cluster. The environment has been created as I set the docker image. The runtime env seems to be an environment setup setting. I need to set many things to complete the runtime env setup, which is contradicting what I have done with docker. Following your suggestion, it returns the following error:
(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please
run `pip install "ray[default]"`.
[ERROR 18:32:18] pymarl Failed after 0:00:02!
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/me/app/epymarl/src/main.py", line 65, in my_main
run_train_meltingpot(_run, config, _log)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
run_sequential(args=args, logger=logger)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
File "/home/me/app/epymarl/src/run_meltingpot.py", line 423, in create_buffer
assert ray.get(buffer.ready.remote())
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
I think it was asking me to set the requirements. I think I do not need to set it as I am using docker.
I want to fix why a created k8s ray pod cluster cannot be reused?
The following shows how I use the k8s ray pod cluster:
1. The admin created a new chart and created a ray operator
2. I use the YAML file to create a ray pod cluster
3. I log in to the head node and run the code (it works fine for debugging purpose)
4. I kill the current programme and then re-run the code. However, the cluster cannot be reused
5. I have to create a new cluster and run my job, which costs more time and patience.
If you are already using docker, it may be faster to bake in all your dependencies in the Docker image. If you want to set up the dependencies dynamically at runtime, you can use runtime_env. If you use them together, my guess is that due to the order of operations, the runtime_env specifications will override the ones in the Docker container.
I saw this: (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please
run pip install "ray[default]"
Is ray[default] installed on all nodes of the cluster?
If that still doesn’t work, to understand the RuntimeEnvSetupError, do you mind pasting the dashboard_agent.log file and sharing what Ray version you’re using? By default these logs are located at /tmp/ray/session_latest/logs on the head node of the cluster.
Hi, @architkulkarni, here is the output of dashboard_agent.log. BTW, why cannot I reuse the cluster? Do you have any best practices? Trial-and-error really take time. I think there are some smart ways to solve this problem.
2022-06-18 23:43:06,723 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:06,724 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:06,725 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:06,998 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:06,998 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:06,998 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:07,000 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:07,001 ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:09,752 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:09,753 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:09,754 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:10,009 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:10,009 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:10,009 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:10,011 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:10,012 ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:14,746 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:14,747 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:14,748 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:15,001 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:15,002 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:15,002 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:15,003 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:15,004 ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:23,747 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:23,748 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:23,749 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:23,982 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:23,983 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:23,983 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:23,985 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:23,985 ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:40,661 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:40,661 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:40,663 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:40,915 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:40,915 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:40,915 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:40,917 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:40,918 ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
loop.run_until_complete(agent.run())
File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
modules = self._load_modules()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
c = cls(self)
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
self._metrics_agent = MetricsAgent(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
prometheus_exporter.new_stats_exporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
exporter = PrometheusStatsExporter(
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
self.serve_http()
File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
start_http_server(
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:44:13,648 INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:44:13,648 INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:44:13,649 INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:44:13,902 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:44:13,902 INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:44:13,902 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:44:13,904 INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
Hi @GoingMyWay thanks for pasting the log. Sorry for the frustration with the trial-and-error, hopefully we can get it working soon. You should be able to reuse the cluster once we figure out this problem, but my guess is for this particular kind of failure the cluster unfortunately needs to be restarted.
I haven’t seen socket.gaierror: [Errno -2] Name or service not known before and I’m not sure how to debug it – it looks like it might be some kind of failure of cluster nodes to communicate with each other over the network. @sangcho or @GuyangSong have you seen this before or do you have any ideas on how to debug it?