Ray k8s cluster, cannot run new task when previous task failed

Hi, @yic, to be more clear. I log in to the head node and then go to the project directory, and then run the code.

ssh head-node
cd /path/to/the/project
python main.py 

/path/to/the/project contains the code, and k8s mounts my code to these pods.

Hi, @yic, I tried to create runtime env like this

        runtime_env = {
            "working_dir": "/home/me/app",
            'excludes': [
                '/home/me/app/.git/',
                '/home/me/app/epymarl/data/',
                '/home/me/app/ray_results/',
                '/home/me/app/third_party/',
            ]
        }
        ray.init("auto", runtime_env=runtime_env)

However, it seems ray cannot really exclude .git and reported the following error:

2022-06-15 15:24:00,301 INFO packaging.py:363 -- Creating a file package for local directory '/home/me/app'.
2022-06-15 15:24:00,656 WARNING packaging.py:259 -- File /home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack is very large (42.68MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/objects/pack/pack-363f95fdf8dc7f3144d8a4daa0695d4dd75ef07e.pack']})`
2022-06-15 15:24:01,745 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack is very large (158.96MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/ray/objects/pack/pack-de70ab7af10a6927b56eed9da619bcaad23c7814.pack']})`
2022-06-15 15:24:02,050 WARNING packaging.py:259 -- File /home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack is very large (190.81MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/home/me/app/.git/modules/third_party/meltingpot/objects/pack/pack-1c8ed26605bd47ade6c6d14b4311af921bbb6255.pack']})`
[ERROR 15:24:11] pymarl Failed after 0:00:21!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 65, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 60, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 104, in run_sequential
    ray.init("auto", runtime_env=runtime_env)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 977, in init
    connect(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1517, in connect
    runtime_env = upload_working_dir_if_needed(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/working_dir.py", line 64, in upload_working_dir_if_needed
    upload_package_if_needed(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 411, in upload_package_if_needed
    upload_package_to_gcs(pkg_uri, package_file.read_bytes())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 343, in upload_package_to_gcs
    _store_package_in_gcs(pkg_uri, pkg_bytes)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/runtime_env/packaging.py", line 218, in _store_package_in_gcs
    raise RuntimeError(
RuntimeError: Package size (532.13MiB) exceeds the maximum size of 100.00MiB. You can exclude large files using the 'excludes' option to the runtime_env.

You can find that excludes seems cannot work.

I tried to add that big file to excludes, and it still reported the same error. I seems excludes cannot work.

@architkulkarni could you take a look at this?

Oh, I think it’s related to this issue: [runtime env] `zip_directory` `excludes` parameter doesn't work with absolute paths · Issue #23473 · ray-project/ray · GitHub I still need to follow up here. @GoingMyWay can you try writing your excludes relative to the working_dir? So it would just be [/.git/, /epymarl/data, ...]

Also, if you think we should modify the default behavior, please feel free to leave a comment here or in the issue. It seems like we can’t support both absolute paths and also support gitignore syntax, because they have conflicting meanings for paths that start with /. So we need to pick a reasonable default, or find a compromise somehow…

Hi, @architkulkarni thanks. I think the runtime env is not what I want. I use docker and created a cluster. The environment has been created as I set the docker image. The runtime env seems to be an environment setup setting. I need to set many things to complete the runtime env setup, which is contradicting what I have done with docker. Following your suggestion, it returns the following error:

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.
[ERROR 18:32:18] pymarl Failed after 0:00:02!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 65, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 423, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

I think it was asking me to set the requirements. I think I do not need to set it as I am using docker.

I want to fix why a created k8s ray pod cluster cannot be reused?

The following shows how I use the k8s ray pod cluster:

1. The admin created a new chart and created a ray operator

2. I use the YAML file to create a ray pod cluster

3. I log in to the head node and run the code (it works fine for debugging purpose)

4. I kill the current programme and then re-run the code. However, the cluster cannot be reused

5. I have to create a new cluster and run my job, which costs more time and patience.

If you are already using docker, it may be faster to bake in all your dependencies in the Docker image. If you want to set up the dependencies dynamically at runtime, you can use runtime_env. If you use them together, my guess is that due to the order of operations, the runtime_env specifications will override the ones in the Docker container.

I saw this: (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please
run pip install "ray[default]"

Is ray[default] installed on all nodes of the cluster?

If that still doesn’t work, to understand the RuntimeEnvSetupError, do you mind pasting the dashboard_agent.log file and sharing what Ray version you’re using? By default these logs are located at /tmp/ray/session_latest/logs on the head node of the cluster.

Yes. In the docker image, I installed this dependency.

Hi, @architkulkarni, here is the output of dashboard_agent.log. BTW, why cannot I reuse the cluster? Do you have any best practices? Trial-and-error really take time. I think there are some smart ways to solve this problem.

2022-06-18 23:43:06,723	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:06,724	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:06,725	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:06,998	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:06,998	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:06,998	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:07,000	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:07,001	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:09,752	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:09,753	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:09,754	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:10,009	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:10,009	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:10,009	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:10,011	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:10,012	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:14,746	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:14,747	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:14,748	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:15,001	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:15,002	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:15,002	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:15,003	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:15,004	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:23,747	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:23,748	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:23,749	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:23,982	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:23,983	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:23,983	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:23,985	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:23,985	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:43:40,661	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:43:40,661	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:43:40,663	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:43:40,915	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:43:40,915	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:43:40,915	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:43:40,917	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2022-06-18 23:43:40,918	ERROR agent.py:436 -- [Errno -2] Name or service not known
Traceback (most recent call last):
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 391, in <module>
    loop.run_until_complete(agent.run())
  File "/home/me/miniconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 178, in run
    modules = self._load_modules()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/agent.py", line 120, in _load_modules
    c = cls(self)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 161, in __init__
    self._metrics_agent = MetricsAgent(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/metrics_agent.py", line 75, in __init__
    prometheus_exporter.new_stats_exporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 332, in new_stats_exporter
    exporter = PrometheusStatsExporter(
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 265, in __init__
    self.serve_http()
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/prometheus_exporter.py", line 319, in serve_http
    start_http_server(
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
    TmpServer.address_family, addr = _get_best_family(addr, port)
  File "/home/me/miniconda3/lib/python3.9/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
    infos = socket.getaddrinfo(address, port)
  File "/home/me/miniconda3/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2022-06-18 23:44:13,648	INFO agent.py:100 -- Parent pid is 115
2022-06-18 23:44:13,648	INFO agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:44581
2022-06-18 23:44:13,649	INFO utils.py:79 -- Get all modules by type: DashboardAgentModule
2022-06-18 23:44:13,902	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2022-06-18 23:44:13,902	INFO event_agent.py:31 -- Event agent cache buffer size: 10240
2022-06-18 23:44:13,902	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2022-06-18 23:44:13,904	INFO agent.py:118 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>

Hi @GoingMyWay thanks for pasting the log. Sorry for the frustration with the trial-and-error, hopefully we can get it working soon. You should be able to reuse the cluster once we figure out this problem, but my guess is for this particular kind of failure the cluster unfortunately needs to be restarted.

I haven’t seen socket.gaierror: [Errno -2] Name or service not known before and I’m not sure how to debug it – it looks like it might be some kind of failure of cluster nodes to communicate with each other over the network. @sangcho or @GuyangSong have you seen this before or do you have any ideas on how to debug it?

Dear @architkulkarni, thanks for the understanding. If you need more context, please let me know.

Seems the socket error is from metrics_agent, which is not in the critical path of tasks. I don’t think it is the root cause of task failed.

Hey @GuyangSong, anything I can do to help you to diagnose?

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.

Does this error message still appear in your case?

If it appears, can you paste the command line of raylet by “ps -ef | grep raylet”?

By the way, you should see the node raylet, ip=172.24.56.163 which is in the prefix of the error log, instead of the node main.py runs on.

Hey @GuyangSong, there is no such error now. But I cannot reuse the cluster.

Sorry, I cannot see it. Currently the error shows the same in my previous post: Ray k8s cluster, cannot run new task when previous task failed

Do you have any idea what is wrong with it?

Have you set the runtime_env ? Is the components module located in your working_dir?

@GuyangSong, For the first run, I did not set it. Then, I set it and ran the code.

pid=gcs_server) [2022-06-23 21:02:30,624 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 512e4dd976cf969e81ae8b479ad888a40cae2f8a7c89aa76a023f104 for actor 4b60b9fcc[0/269]
bcd40a5601000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,633 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 7bc55c1eecaa08f9
fa80dbd901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,642 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor 5497df4a81fac901
e1be7ec401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,652 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e6aa9f3bfb8cea4d
b7d08b8401000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,668 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node b88ae7a900d5774a398def1c9792c5e6692c2446e85183e00c257b9f for actor 6e15381ff4d31a63
3c77974d01000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,684 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor bb6379f2a6cb30db
f408263901000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
[INFO 21:02:30] run_meltingpot Buffer size: 600
[ERROR 21:02:30] pymarl Failed after 0:00:03!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 66, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 509, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

(raylet) [2022-06-23 21:02:30,763 C 109 109] (raylet) dependency_manager.cc:208:  Check failed: task_entry != queued_task_requests_.end() Can't remove dependencies of tasks that are not queued.
(raylet) *** StackTrace Information ***
(raylet)     ray::SpdLogMessage::Flush()
(raylet)     ray::RayLog::~RayLog()
(raylet)     ray::raylet::DependencyManager::RemoveTaskDependencies()
(raylet)     ray::raylet::ClusterTaskManager::PoppedWorkerHandler()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     std::_Function_handler<>::_M_invoke()
(raylet)     boost::asio::detail::wait_handler<>::do_complete()
(raylet)     boost::asio::detail::scheduler::do_run_one()
(raylet)     boost::asio::detail::scheduler::run()
(raylet)     boost::asio::io_context::run()
(raylet)     main
(raylet)     __libc_start_main
(raylet) 
(pid=gcs_server) [2022-06-23 21:02:30,709 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 4b68912ec6784ea01f26a9548bf68221a58faee9afb6ad0dea2dacaa for actor e23a3d4127997687
6bdce53201000000(_QueueActor.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED
(pid=gcs_server) [2022-06-23 21:02:30,734 E 60 60] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 167498de1e2bf62a2035943e1f85515f74c77677c92fdddc217ae725 for actor 57c6f66434f69b96
3200f29d01000000(ReplayBufferwithQueue.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

Then. I also launched a new cluster and ran the code. I got the same error.

The components is the code of my project. I used docker to mount my code.

@GuyangSong, Hey you can see this post for more information: Ray k8s cluster, cannot run new task when previous task failed - #14 by GoingMyWay