Ray k8s cluster, cannot run new task when previous task failed

Hi @GoingMyWay it seems like an environmental issue. Do you want to put runtime env into your job (working_dir to be more specific) ?

Or you need to make sure your k8s environment is the same as your local environment.

Hi, I will try to use set the runtime env. Currently, I use ray.init("auto") to launch ray.

By the way, this error frustrates me because if my task fails, I need to launch a new cluster, which will cost a lot of time. Will you make this easy to use, since there is an auto option in ray.init?

@GoingMyWay I read your question again and I feel it’s very wired:

However, when my task failed and I restarted the task again, the cluster cannot be used and I had to launch a new cluster.

Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.

By the way, this error frustrates me because if my task fails, I need to launch a new cluster, which will cost a lot of time.

I don’t think you need to relaunch a cluster to rerun the job in either way. For example, if ray.init with working dir works, it should be ok if you just rerun the task.

If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?

  1. Did you restart your job in the same directory? For the working dir of the driver, it’ll be added automatically.

Yes. I restarted my job in the same directory.

  1. If this still happens, do you mind trying to get a minimal reproducible script and submitting an issue for this?

I think the reason may be the ray cluster operator was created by the cluster administrator. I can give you the ray_cluster.yaml file which was used to create a new k8s ray cluster. I will try my best to put the reproducible code. I think the code is nothing different. I use ray.init("auto") to initialize ray.

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster
spec:
  # The maximum number of workers nodes to launch in addition to the head node.
  maxWorkers: 100
  # The autoscaler will scale up the cluster faster with higher upscaling speed.
  # E.g., if the task requires adding more nodes then autoscaler will gradually
  # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
  # This number should be > 0.
  upscalingSpeed: 1.0
  # If a node is idle for this many minutes, it will be removed.
  idleTimeoutMinutes: 5
  # Specify the pod type for the ray head node (as configured below).
  headPodType: rayHead
  # Specify the allowed pod types for this ray cluster and the resources they provide.
  podTypes:
    - name: rayHead
      minWorkers: 0
      maxWorkers: 0
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-head-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory 
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 10
                  memory: 100Gi
                  nvidia.com/gpu: 1
                limits:
                  cpu: 10
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 50Gi
                  nvidia.com/gpu: 1
          nodeSelector: {}
          tolerations: []
    - name: rayWorker
      minWorkers: 2
      maxWorkers: 2
      rayResources: {}
      podConfig:
        apiVersion: v1
        kind: Pod
        metadata:
          generateName: ray-worker-
        spec:
          imagePullSecrets:
            - name: gitlab-cr-pull-secret
            - name: regcred
          priorityClassName: high
          restartPolicy: Never
          # This volume allocates shared memory for Ray to use for its plasma
          # object store. If you do not provide this, Ray will fall back to
          # /tmp which cause slowdowns if is not a shared memory volume.
          volumes:
            - name: workspace-vol
              hostPath:
                path: /mnt/home/%USER/Projects/work_dir
                type: Directory
            - name: dshm
              emptyDir:
                medium: Memory
          containers:
            - name: ray-node
              imagePullPolicy: Always
              image: "the.image:tag"
              # Do not change this command - it keeps the pod alive until it is
              # explicitly killed.
              command: ["/bin/bash", "-c", "--"]
              args: ["trap : TERM INT; sleep infinity & wait;"]
              env:
                - name: RAY_gcs_server_rpc_server_thread_num
                  value: "1"
              ports:
                - containerPort: 6379 # Redis port for Ray <= 1.10.0. GCS server port for Ray >= 1.11.0.
                - containerPort: 10001 # Used by Ray Client
                - containerPort: 8265 # Used by Ray Dashboard
                - containerPort: 8000 # Used by Ray Serve

              # This volume allocates shared memory for Ray to use for its plasma
              # object store. If you do not provide this, Ray will fall back to
              # /tmp which cause slowdowns if is not a shared memory volume.
              volumeMounts:
                - name: workspace-vol
                  mountPath: /home/me/app/
                  readOnly: false
                - mountPath: /dev/shm
                  name: dshm
              resources:
                requests:
                  cpu: 33
                  memory: 50Gi
                  nvidia.com/gpu: 0
                limits:
                  cpu: 33
                  # The maximum memory that this pod is allowed to use. The
                  # limit will be detected by ray and split to use 10% for
                  # redis, 30% for the shared memory object store, and the
                  # rest for application memory. If this limit is not set and
                  # the object store size is not set manually, ray will
                  # allocate a very large object store in each pod that may
                  # cause problems for other pods.
                  memory: 100Gi
                  nvidia.com/gpu: 0
          nodeSelector: {}
          tolerations: []
          
  # Commands to start Ray on the head node. You don't need to change this.
  # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0
  # Commands to start Ray on worker nodes. You don't need to change this.
  workerStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

I use the following command to create a cluster.

kubectl -n ray_cluster apply -f ray_cluster.yaml

Thanks for the details here. So the cluster is started by the admin and you login to one of the workers and call ray.init right?

I notice the actual error is

ModuleNotFoundError: No module named 'components'                                                                              

Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?

One of the reasons I can think of is that the first time when you ran it, the ray python worker starts on the node your local module is, and the next time when you ran it (not restart the cluster), it got scheduled to another worker. The root cause might be that the two workers are running in a different environment which in the end makes you see different results.

Hi @yic, thanks for the response. Note that the admin deployed the cluster and created a namespace. Then, I can use kubectl -n ray_cluster apply -f ray_cluster.yaml to create a new cluster for me to use. After that, I can login the head node and then run the code.

Can you verify that you have this module in all worker nodes? Also, if it’s a local module, can you try py_modules in the runtime env?

It seems the components is one private module of my project. I will try to use the runtime env and report the results to you.

Hi, @yic, to be more clear. I log in to the head node and then go to the project directory, and then run the code.

ssh head-node
cd /path/to/the/project
python main.py 

/path/to/the/project contains the code, and k8s mounts my code to these pods.

Hi, @yic, I tried to create runtime env like this

        runtime_env = {
            "working_dir": "/home/me/app",
            'excludes': [
                '/home/me/app/.git/',
                '/home/me/app/epymarl/data/',
                '/home/me/app/ray_results/',
                '/home/me/app/third_party/',
            ]
        }
        ray.init("auto", runtime_env=runtime_env)

However, it seems ray cannot really exclude .git and reported the following error:

2022-06-15 15:24:00,301 INFO packaging.py:363 -- Creating a file package for local directory '/home/me/app'.
2022-06-15 15:24:00,656 WARNING packaging.py:259 -- File /home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack is very large (42.68MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack']})`
2022-06-15 15:24:01,745 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack is very large (158.96MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack']})`
2022-06-15 15:24:02,050 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack is very large (190.81MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack']})`
[ERROR 15:24:11] pymarl Failed after 0:00:21!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 65, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 60, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 104, in run_sequential
    ray.init("auto", runtime_env=runtime_env)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 977, in init
    connect(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1517, in connect
    runtime_env = upload_working_dir_if_needed(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/working_dir.py", line 64, in upload_working_dir_if_needed
    upload_package_if_needed(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 411, in upload_package_if_needed
    upload_package_to_gcs(pkg_uri, package_file.read_bytes())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 343, in upload_package_to_gcs
    _store_package_in_gcs(pkg_uri, pkg_bytes)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 218, in _store_package_in_gcs
    raise RuntimeError(
RuntimeError: Package size (532.13MiB) exceeds the maximum size of 100.00MiB. You can exclude large files using the 'excludes' option to the runtime_env.

You can find that excludes seems cannot work.

I tried to add that big file to excludes, and it still reported the same error. I seems excludes cannot work.

@architkulkarni could you take a look at this?

Oh, I think it’s related to this issue: [runtime env] `zip_directory` `excludes` parameter doesn't work with absolute paths · Issue #23473 · ray-project/ray · GitHub I still need to follow up here. @GoingMyWay can you try writing your excludes relative to the working_dir? So it would just be [/.git/, /epymarl/data, ...]

Also, if you think we should modify the default behavior, please feel free to leave a comment here or in the issue. It seems like we can’t support both absolute paths and also support gitignore syntax, because they have conflicting meanings for paths that start with /. So we need to pick a reasonable default, or find a compromise somehow…

Hi, @architkulkarni thanks. I think the runtime env is not what I want. I use docker and created a cluster. The environment has been created as I set the docker image. The runtime env seems to be an environment setup setting. I need to set many things to complete the runtime env setup, which is contradicting what I have done with docker. Following your suggestion, it returns the following error:

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.
[ERROR 18:32:18] pymarl Failed after 0:00:02!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 65, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 423, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

I think it was asking me to set the requirements. I think I do not need to set it as I am using docker.

I want to fix why a created k8s ray pod cluster cannot be reused?

The following shows how I use the k8s ray pod cluster:

1. The admin created a new chart and created a ray operator

2. I use the YAML file to create a ray pod cluster

3. I log in to the head node and run the code (it works fine for debugging purpose)

4. I kill the current programme and then re-run the code. However, the cluster cannot be reused

5. I have to create a new cluster and run my job, which costs more time and patience.

If you are already using docker, it may be faster to bake in all your dependencies in the Docker image. If you want to set up the dependencies dynamically at runtime, you can use runtime_env. If you use them together, my guess is that due to the order of operations, the runtime_env specifications will override the ones in the Docker container.

I saw this: (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please
run pip install "ray[default]"

Is ray[default] installed on all nodes of the cluster?

If that still doesn’t work, to understand the RuntimeEnvSetupError, do you mind pasting the dashboard_agent.log file and sharing what Ray version you’re using? By default these logs are located at /tmp/ray/session_latest/logs on the head node of the cluster.

Yes. In the docker image, I installed this dependency.

Hi, @architkulkarni, here is the output of dashboard_agent.log. BTW, why cannot I reuse the cluster? Do you have any best practices? Trial-and-error really take time. I think there are some smart ways to solve this problem.

2022-06-18 23:43:06,723	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:06,724	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:06,725	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:06,998	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:06,998	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:06,998	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:07,000	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:07,001	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:09,752	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:09,753	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:09,754	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:10,009	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:10,009	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:10,009	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:10,011	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:10,012	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:14,746	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:14,747	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:14,748	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:15,001	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:15,002	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:15,002	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:15,003	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:15,004	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:23,747	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:23,748	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:23,749	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:23,982	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:23,983	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:23,983	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:23,985	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:23,985	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:40,661	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:40,661	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:40,663	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:40,915	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:40,915	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:40,915	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:40,917	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:40,918	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:44:13,648	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:44:13,648	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:44:13,649	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:44:13,902	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:44:13,902	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:44:13,902	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:44:13,904	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>

Hi @GoingMyWay thanks for pasting the log. Sorry for the frustration with the trial-and-error, hopefully we can get it working soon. You should be able to reuse the cluster once we figure out this problem, but my guess is for this particular kind of failure the cluster unfortunately needs to be restarted.

I haven’t seen socket.gaierror: [Errno -2] Name or service not known before and I’m not sure how to debug it – it looks like it might be some kind of failure of cluster nodes to communicate with each other over the network. @sangcho or @GuyangSong have you seen this before or do you have any ideas on how to debug it?

Dear @architkulkarni, thanks for the understanding. If you need more context, please let me know.

Seems the socket error is from metrics_agent, which is not in the critical path of tasks. I don’t think it is the root cause of task failed.

Hey @GuyangSong, anything I can do to help you to diagnose?