Warnings with TuneSearchCV

e[2me[36m(pid=29935)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

2021-05-04 12:10:16,076 WARNING worker.py:1115 -- This worker was asked to execute a function that it does not have registered. You may have to restart Ray.

e[2me[36m(pid=29923)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

e[2me[36m(pid=29944)e[0m 2021-05-04 12:10:46,840 INFO logger.py:688 -- Removed the following hyperparameter values when logging to tensorboard: {'class_weight': None}

e[2me[36m(pid=29944)e[0m 2021-05-04 12:10:46,854 WARNING util.py:161 -- Processing trial results took 0.741 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.

2021-05-04 12:10:58,557 WARNING worker.py:1115 -- The actor or task with ID c7ef5cdfe30c4545ffffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {CPU_group_0_5974b84dbdc7cf3e0649b10f79418307: 0.010100}, {CPU_group_5974b84dbdc7cf3e0649b10f79418307: 0.010100} for placement, but this node only has remaining {12.480000/24.000000 CPU, 163.135506 GiB/163.135506 GiB memory, 73.906599 GiB/73.906599 GiB object_store_memory, 0.000000/0.480000 CPU_group_0_56c7a3bea2b30e16ee87ff9d7726ea8f, 0.000000/0.480000 CPU_group_66fddc0ffbc8e04201d99f2595d8cf76, 0.000000/0.480000 CPU_group_0_77195cbdf047b2a21c2969555d181992, 0.000000/0.480000 CPU_group_0_8cec3160ef168630f4ba5febb5614704, 0.000000/0.480000 CPU_group_0_24071bb6040fba0fe7fba95836b2368c, 0.480000/0.480000 CPU_group_0_5974b84dbdc7cf3e0649b10f79418307, 0.000000/0.480000 CPU_group_a6c717f85692a87adc78e39cebadb034, 0.000000/0.480000 CPU_group_90a68ed06183f8e1ebe39cbb46f1e04c, 0.000000/0.480000 CPU_group_490a2308ad38eb314b896c6c1192af5c, 0.000000/0.480000 CPU_group_0_11dc3e47d27e53bc63ae9c1b2a521f54, 0.000000/0.480000 CPU_group_8c2b2beff9830b618f84d89ddf417252, 0.000000/0.480000 CPU_group_f22db95399fc8085fc23f597d0497a6f, 0.000000/0.480000 CPU_group_0_490a2308ad38eb314b896c6c1192af5c, 0.000000/0.480000 CPU_group_24071bb6040fba0fe7fba95836b2368c, 0.000000/0.480000 CPU_group_0_8c2b2beff9830b618f84d89ddf417252, 0.000000/0.480000 CPU_group_bda9bc6a9f8c77ce21b5613b741796f4, 0.000000/0.480000 CPU_group_87cf9529dbfd2cb8ba40fbf9ac069fd9, 0.000000/0.480000 CPU_group_0_7dc11190f27d62c6641917121d9ea70e, 0.000000/0.480000 CPU_group_41a13077937f03ec3786e423d5a222ac, 0.000000/0.480000 CPU_group_0_48548e87ab92c889f76764bbc497a782, 0.000000/0.480000 CPU_group_d3a4e8463d770240ce6c870b81ddae57, 0.000000/0.480000 CPU_group_77195cbdf047b2a21c2969555d181992, 0.000000/0.480000 CPU_group_15eb412769a23468b757411a0de4bd20, 0.000000/0.480000 CPU_group_11dc3e47d27e53bc63ae9c1b2a521f54, 0.000000/0.480000 CPU_group_0_0d5cd62ae0866f2889cbdf32129a6287, 1.000000/1.000000 node:172.20.201.40, 0.480000/0.480000 CPU_group_5974b84dbdc7cf3e0649b10f79418307, 0.000000/0.480000 CPU_group_56c7a3bea2b30e16ee87ff9d7726ea8f, 0.000000/0.480000 CPU_group_edc249257cb7d5435230ee8c53acd020, 0.000000/0.480000 CPU_group_0_87cf9529dbfd2cb8ba40fbf9ac069fd9, 0.000000/0.480000 CPU_group_0d5cd62ae0866f2889cbdf32129a6287, 0.000000/0.480000 CPU_group_0_f22db95399fc8085fc23f597d0497a6f, 0.000000/0.480000 CPU_group_0_edc249257cb7d5435230ee8c53acd020, 0.000000/0.480000 CPU_group_97a4e872eabf35ed9d7a079b6a18ef98, 0.000000/0.480000 CPU_group_0_90a68ed06183f8e1ebe39cbb46f1e04c, 0.000000/0.480000 CPU_group_bd48c7e6a3a67dabbeab324ae8f45ab0, 0.000000/0.480000 CPU_group_48548e87ab92c889f76764bbc497a782, 0.000000/0.480000 CPU_group_0_41a13077937f03ec3786e423d5a222ac, 0.000000/0.480000 CPU_group_0_15eb412769a23468b757411a0de4bd20, 0.000000/0.480000 CPU_group_0_fe1885176868d2a68b91ff1f489b612f, 0.000000/0.480000 CPU_group_0_d3a4e8463d770240ce6c870b81ddae57, 0.000000/0.480000 CPU_group_8cec3160ef168630f4ba5febb5614704, 0.000000/0.480000 CPU_group_0_97a4e872eabf35ed9d7a079b6a18ef98, 0.000000/0.480000 CPU_group_0_bda9bc6a9f8c77ce21b5613b741796f4, 0.000000/0.480000 CPU_group_0_bd48c7e6a3a67dabbeab324ae8f45ab0, 0.000000/0.480000 CPU_group_fe1885176868d2a68b91ff1f489b612f, 0.000000/0.480000 CPU_group_7dc11190f27d62c6641917121d9ea70e, 0.000000/0.480000 CPU_group_0_a6c717f85692a87adc78e39cebadb034, 0.000000/0.480000 CPU_group_0_66fddc0ffbc8e04201d99f2595d8cf76}

. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

e[2me[33m(raylet)e[0m /home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.

e[2me[33m(raylet)e[0m warnings.warn(
e[2me[33m(raylet)e[0m [2021-05-04 13:42:10,343 E 29850 29850] logging.cc:435:     @     0x555555778387 ray::raylet::ClusterTaskManager::DispatchScheduledTasksToWorkers()
e[2me[33m(raylet)e[0m [2021-05-04 13:42:14,091 E 29850 29850] logging.cc:435:     @     0x55555577b4aa ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
e[2me[36m(pid=29943)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
e[2me[33m(raylet)e[0m [2021-05-04 13:42:17,607 E 29850 29850] logging.cc:435:     @     0x55555570845b _ZZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS_7ProcessERKNS_3rpc8LanguageENS5_10WorkerTypeEENKUlN5boost6system10error_codeEE_clESC_.isra.0
e[2me[33m(raylet)e[0m [2021-05-04 13:42:19,370 E 29850 29850] logging.cc:435:     @     0x555555708d7f _ZN5boost4asio6detail17executor_functionINS1_7binder1IZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS4_7ProcessERKNS4_3rpc8LanguageENSA_10WorkerTypeEEUlNS_6system10error_codeEE_SG_EESaIvEE11do_completeEPNS1_22executor_function_baseEb
e[2me[33m(raylet)e[0m [2021-05-04 13:42:24,869 E 29850 29850] logging.cc:435:     @     0x5555556a4ef0 boost::asio::io_context::executor_type::dispatch<>()
e[2me[33m(raylet)e[0m [2021-05-04 13:42:27,147 E 29850 29850] logging.cc:435:     @     0x555555708ab8 _ZN5boost4asio6detail12wait_handlerIZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS3_7ProcessERKNS3_3rpc8LanguageENS9_10WorkerTypeEEUlNS_6system10error_codeEE_NS1_18io_object_executorINS0_8executorEEEE11do_completeEPvPNS1_19scheduler_operationERKSF_m
e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'std::system_error'
    e[2me[33m(raylet)e[0m   what():  Resource temporarily unavailable
    e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'std::system_error'
    e[2me[33m(raylet)e[0m   what():  terminate called after throwing an instance of 'Resource temporarily unavailablestd::system_error
    e[2me[33m(raylet)e[0m '

Not sure if TuneSearchCV is copying the dataset every time? Is it worth switching to tune.run with parameters? Here’s my implementation -

clf = TuneSearchCV(model,
                    param_distributions=config,
                    n_trials=500,
                    early_stopping=False,
                    max_iters=1,   
                    search_optimization="bayesian",
                    n_jobs=50,
                    refit=True,
                    cv= StratifiedKFold(n_splits=5,shuffle=True,random_state=42),
                    verbose=0,
                    #loggers = "tensorboard",
                    random_state=42,
                    local_dir="./ray_results",
                    )
        clf.fit(X_train, Y_train)

Just FYI that I’m using slurm to submit jobs in our cluster. Here’s the resources requested -

Nodes - 1
Tasks - 1
cpus - 5
Memory - 150G

Please help!
Thanks in advance!

It died with the following error -

2021-05-05 04:06:36,411 WARNING worker.py:1115 -- The node with node id: bd1fdd835700b5be56f27466807c5c9e757545bd176fb12b59d159fe and ip: 172.20.201.72 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

Can someone please help me with this?
Thank you!

Hmm, it seems to specifically actually fail here:

e[2me[33m(raylet)e[0m [2021-05-04 13:42:10,343 E 29850 29850] logging.cc:435:     @     0x555555778387 ray::raylet::ClusterTaskManager::DispatchScheduledTasksToWorkers()
e[2me[33m(raylet)e[0m [2021-05-04 13:42:14,091 E 29850 29850] logging.cc:435:     @     0x55555577b4aa ray::raylet::ClusterTaskManager::ScheduleAndDispatchTasks()
e[2me[33m(raylet)e[0m [2021-05-04 13:42:17,607 E 29850 29850] logging.cc:435:     @     0x55555570845b _ZZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS_7ProcessERKNS_3rpc8LanguageENS5_10WorkerTypeEENKUlN5boost6system10error_codeEE_clESC_.isra.0
e[2me[33m(raylet)e[0m [2021-05-04 13:42:19,370 E 29850 29850] logging.cc:435:     @     0x555555708d7f _ZN5boost4asio6detail17executor_functionINS1_7binder1IZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS4_7ProcessERKNS4_3rpc8LanguageENSA_10WorkerTypeEEUlNS_6system10error_codeEE_SG_EESaIvEE11do_completeEPNS1_22executor_function_baseEb
e[2me[33m(raylet)e[0m [2021-05-04 13:42:24,869 E 29850 29850] logging.cc:435:     @     0x5555556a4ef0 boost::asio::io_context::executor_type::dispatch<>()
e[2me[33m(raylet)e[0m [2021-05-04 13:42:27,147 E 29850 29850] logging.cc:435:     @     0x555555708ab8 _ZN5boost4asio6detail12wait_handlerIZN3ray6raylet10WorkerPool28MonitorStartingWorkerProcessERKNS3_7ProcessERKNS3_3rpc8LanguageENS9_10WorkerTypeEEUlNS_6system10error_codeEE_NS1_18io_object_executorINS0_8executorEEEE11do_completeEPvPNS1_19scheduler_operationERKSF_m
e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'std::system_error'
    e[2me[33m(raylet)e[0m   what():  Resource temporarily unavailable
    e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'std::system_error'
    e[2me[33m(raylet)e[0m   what():  terminate called after throwing an instance of 'Resource temporarily unavailablestd::system_error
    e[2me[33m(raylet)e[0m '

@tkmamidi could you post a copy of /tmp/ray/session_latest/logs?

Thanks for the reply. There are a bunch of logs. Which one am I specifically looking at?

image

Just FYI that I’m running multiple tuning scripts (multiple classifiers) and the above warnings are a mixture from all of those. So, I’m looking for explanations for each of those warnings.

Thanks in advance!

@tkmamidi if you could do:

tail -n 50 ./*

and post that output, that’d be much appreciated!

==> ./dashboard_agent.log <==
2021-05-04 18:57:32,175 INFO agent.py:72 -- Parent pid is 4512
2021-05-04 18:57:32,175 INFO agent.py:76 -- Dashboard agent grpc address: 172.20.201.73:41167
2021-05-04 18:57:32,178 INFO utils.py:202 -- Get all modules by type: DashboardAgentModule
2021-05-04 18:57:32,644 INFO agent.py:90 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.log.log_agent.LogAgent'>
2021-05-04 18:57:32,644 INFO agent.py:90 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2021-05-04 18:57:32,647 INFO agent.py:95 -- Loaded 2 modules.
2021-05-04 18:57:32,648 INFO agent.py:163 -- Dashboard agent http address: 172.20.201.73:39199
2021-05-04 18:57:32,648 INFO agent.py:171 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/scratch/tmamidi/session_2021-05-04_18-57-29_208237_4454/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/scratch/tmamidi/session_2021-05-04_18-57-29_208237_4454/logs')>>
2021-05-04 18:57:32,648 INFO agent.py:171 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/scratch/tmamidi/session_2021-05-04_18-57-29_208237_4454/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x2aaac1f31af0>>
2021-05-04 18:57:32,649 INFO agent.py:172 -- Registered 2 routes.
2021-05-04 18:58:02,834 ERROR reporter_agent.py:531 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 520, in _perform_iteration
    formatted_status_string = await aioredis_client.hget(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/aioredis/pool.py", line 257, in _wait_execute
    conn = await self.acquire(command, args)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/aioredis/pool.py", line 324, in acquire
    await self._fill_free(override_min=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/aioredis/pool.py", line 383, in _fill_free
    conn = await self._create_new_connection(self._address)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/aioredis/connection.py", line 111, in create_connection
    reader, writer = await asyncio.wait_for(open_connection(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/aioredis/stream.py", line 23, in open_connection
    transport, _ = await get_event_loop().create_connection(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/base_events.py", line 1025, in create_connection
    raise exceptions[0]
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/base_events.py", line 1010, in create_connection
    sock = await self._connect_sock(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/base_events.py", line 924, in _connect_sock
    await self.sock_connect(sock, address)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/selector_events.py", line 494, in sock_connect
    return await fut
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/asyncio/selector_events.py", line 526, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.20.201.73', 6379)

==> ./dashboard.log <==
2021-05-04 18:57:30,535 INFO head.py:135 -- Loaded 6 modules.
2021-05-04 18:57:30,537 INFO head.py:213 -- Dashboard head http address: 127.0.0.1:8265
2021-05-04 18:57:30,537 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /> -> <function Dashboard.get_index at 0x2aaabb6be0d0>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /favicon.ico> -> <function Dashboard.get_favicon at 0x2aaabb6be1f0>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <StaticResource  /static -> PosixPath('/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/new_dashboard/client/build/static')> -> <bound method StaticResource._handle of <StaticResource  /static -> PosixPath('/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/new_dashboard/client/build/static')>>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /jobs> -> <function JobHead.get_all_jobs[cache ttl=2, max_size=128] at 0x2aaabb6bedc0>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <DynamicResource  /jobs/{job_id}> -> <function JobHead.get_job[cache ttl=2, max_size=128] at 0x2aaabb6bef70>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /log_index> -> <function LogHead.get_log_index at 0x2aaabbce5670>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /log_proxy> -> <function LogHead.get_log_from_proxy at 0x2aaabbce5790>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /logical/actor_groups> -> <function LogicalViewHead.get_actor_groups at 0x2aaabbcfc790>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /logical/actors> -> <function LogicalViewHead.get_all_actors[cache ttl=2, max_size=128] at 0x2aaabbcfc8b0>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /logical/kill_actor> -> <function LogicalViewHead.kill_actor at 0x2aaabbcfca60>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /api/launch_profiling> -> <function ReportHead.launch_profiling at 0x2aaabbd4dc10>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /api/ray_config> -> <function ReportHead.get_ray_config at 0x2aaabbd4dd30>
2021-05-04 18:57:30,538 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /api/cluster_status> -> <function ReportHead.get_cluster_status at 0x2aaabbd4de50>
2021-05-04 18:57:30,539 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /nodes> -> <function StatsCollector.get_all_nodes[cache ttl=2, max_size=128] at 0x2aaac0c0d3a0>
2021-05-04 18:57:30,539 INFO head.py:227 -- <ResourceRoute [GET] <DynamicResource  /nodes/{node_id}> -> <function StatsCollector.get_node[cache ttl=2, max_size=128] at 0x2aaac0c0d550>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /memory/memory_table> -> <function StatsCollector.get_memory_table at 0x2aaac0c0d700>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /memory/set_fetch> -> <function StatsCollector.set_fetch_memory_info at 0x2aaac0c0d820>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /node_logs> -> <function StatsCollector.get_logs at 0x2aaac0c0d940>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /node_errors> -> <function StatsCollector.get_errors at 0x2aaac0c0da60>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /tune/info> -> <function TuneController.tune_info at 0x2aaac28ab700>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /tune/availability> -> <function TuneController.get_availability at 0x2aaac28ab820>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /tune/set_experiment> -> <function TuneController.set_tune_experiment at 0x2aaac28ab940>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <PlainResource  /tune/enable_tensorboard> -> <function TuneController.enable_tensorboard at 0x2aaac28aba60>
2021-05-04 18:57:30,540 INFO head.py:227 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/scratch/tmamidi/session_2021-05-04_18-57-29_208237_4454/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/scratch/tmamidi/session_2021-05-04_18-57-29_208237_4454/logs')>>
2021-05-04 18:57:30,540 INFO head.py:228 -- Registered 24 routes.
2021-05-04 18:57:30,541 INFO datacenter.py:65 -- Purge data.
2021-05-04 18:57:30,542 INFO reporter_head.py:138 -- Subscribed to RAY_REPORTER:*
2021-05-04 18:57:30,542 INFO job_head.py:73 -- Subscribed to JOB:*
2021-05-04 18:57:30,542 INFO job_head.py:78 -- Getting all job info from GCS.
2021-05-04 18:57:30,542 INFO stats_collector_head.py:176 -- Subscribed to ACTOR:*
2021-05-04 18:57:30,543 INFO stats_collector_head.py:186 -- Getting all actor info from GCS.
2021-05-04 18:57:30,543 INFO stats_collector_head.py:284 -- Subscribed to <_Sender name:b'RAY_LOG_CHANNEL', is_pattern:False, receiver:<Receiver is_active:True, senders:1, qsize:0>>
2021-05-04 18:57:30,543 INFO stats_collector_head.py:307 -- Subscribed to b'ERROR_INFO:*'
2021-05-04 18:57:30,544 INFO stats_collector_head.py:212 -- Received 0 actor info from GCS.
2021-05-04 18:57:30,544 INFO job_head.py:89 -- Received 0 job info from GCS.
2021-05-04 18:57:30,941 INFO stats_collector_head.py:296 -- Received a log for 172.20.201.73 and autoscaler
2021-05-04 18:57:32,153 INFO stats_collector_head.py:296 -- Received a log for 172.20.201.73 and raylet
2021-05-04 18:58:02,591 ERROR stats_collector_head.py:276 -- Error updating node stats of 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f.
Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/new_dashboard/modules/stats_collector/stats_collector_head.py", line 269, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1620172682.591427245","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5419,"referenced_errors":[{"created":"@1620172682.591424226","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>

==> ./gcs_server.err <==

==> ./gcs_server.out <==
[2021-05-04 18:57:29,487 I 4468 4468] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:44: Loading job table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:56: Loading node table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:68: Loading object table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:81: Loading cluster resources table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:108: Loading actor table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:94: Loading placement group table data.
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:48: Finished loading job table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:60: Finished loading node table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:73: Finished loading object table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:85: Finished loading cluster resources table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:112: Finished loading actor table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_init_data.cc:99: Finished loading placement group table data, size = 0
[2021-05-04 18:57:29,488 I 4468 4468] gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=300
[2021-05-04 18:57:29,506 I 4468 4468] grpc_server.cc:71: GcsServer server started, listening on port 42496.
[2021-05-04 18:57:29,514 I 4468 4468] gcs_server.cc:276: Gcs server address = 172.20.201.73:42496
[2021-05-04 18:57:29,514 I 4468 4468] gcs_server.cc:280: Finished setting gcs server address: 172.20.201.73:42496
[2021-05-04 18:57:29,514 I 4468 4468] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-04 18:57:30,543 I 4468 4468] gcs_job_manager.cc:93: Getting all job info.
[2021-05-04 18:57:30,544 I 4468 4468] gcs_job_manager.cc:99: Finished getting all job info.
[2021-05-04 18:57:31,637 I 4468 4468] gcs_node_manager.cc:34: Registering node info, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f, address = 172.20.201.73
[2021-05-04 18:57:31,638 I 4468 4468] gcs_node_manager.cc:39: Finished registering node info, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f, address = 172.20.201.73
[2021-05-04 18:57:31,640 I 4468 4468] gcs_job_manager.cc:93: Getting all job info.
[2021-05-04 18:57:31,640 I 4468 4468] gcs_job_manager.cc:99: Finished getting all job info.
[2021-05-04 18:58:02,251 I 4468 4468] gcs_node_manager.cc:55: Unregistering node info, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f
[2021-05-04 18:58:02,252 I 4468 4468] gcs_node_manager.cc:136: Removing node, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f
[2021-05-04 18:58:02,252 I 4468 4468] gcs_placement_group_manager.cc:532: Node 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f failed, rescheduling the placement groups on the dead node.
[2021-05-04 18:58:02,252 I 4468 4468] gcs_actor_manager.cc:606: Node 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f failed, reconstructing actors.
[2021-05-04 18:58:02,252 I 4468 4468] gcs_node_manager.cc:72: Finished unregistering node info, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f
[2021-05-04 18:58:02,465 I 4468 4468] gcs_server_main.cc:111: GCS server received SIGTERM, shutting down...
[2021-05-04 18:58:02,465 I 4468 4468] gcs_server.cc:135: Stopping GCS server.
[2021-05-04 18:58:02,470 I 4468 4468] gcs_server.cc:142: GCS server stopped.
[2021-05-04 18:58:02,470 I 4468 4468] io_service_pool.cc:47: IOServicePool is stopped.

==> ./log_monitor.log <==
2021-05-04 18:57:30,939 INFO log_monitor.py:162 -- Beginning to track file raylet.err
2021-05-04 18:57:30,939 INFO log_monitor.py:162 -- Beginning to track file gcs_server.err
2021-05-04 18:57:30,939 INFO log_monitor.py:162 -- Beginning to track file monitor.log

==> ./monitor.err <==
    return self._handle_failure(f"Terminated with signal {sig}\n" +
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1202, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.73:6379. Connection refused.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 376, in <module>
    monitor.run()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 286, in run
    self._handle_failure(traceback.format_exc())
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.73:6379. Connection refused.

==> ./monitor.log <==
2021-05-04 18:57:29,913 INFO monitor.py:122 -- Monitor: Started
2021-05-04 18:58:02,646 ERROR monitor.py:253 -- Error in monitor loop
NoneType: None
2021-05-04 18:58:02,649 ERROR monitor.py:253 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1198, in get_connection
    if connection.can_read():
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 734, in can_read
    return self._parser.can_read(timeout)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 416, in can_read
    return self.read_from_socket(timeout=timeout,
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 429, in read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 202, in _run
    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 272, in _signal_handler
    return self._handle_failure(f"Terminated with signal {sig}\n" +
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1202, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.73:6379. Connection refused.

==> ./monitor.out <==

==> ./old <==
tail: error reading ‘./old’: Is a directory

==> ./plasma_store.err <==
[2021-05-04 18:58:03,388 E 4511 4511] logging.cc:435: *** Aborted at 1620172683 (unix time) try "date -d @1620172683" if you are using GNU date ***
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435: PC: @                0x0 (unknown)
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435: *** SIGTERM (@0x2d2800001166) received by PID 4511 (TID 0x2aaaaaaf9b00) from PID 4454; stack trace: ***
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x5555555f677f google::(anonymous namespace)::FailureSignalHandler()
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x2aaaaacde630 (unknown)
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x2aaaaacdde80 __nanosleep_nocancel
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x55555557049b main
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x2aaaab7ab555 __libc_start_main
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x555555572d35 (unknown)

==> ./plasma_store.out <==
[2021-05-04 18:57:30,594 I 4511 4511] store_exec.cc:81: The Plasma Store is started with the '-z' flag, and it will run idle as a placeholder.
[2021-05-04 18:58:03,388 E 4511 4511] logging.cc:435: *** Aborted at 1620172683 (unix time) try "date -d @1620172683" if you are using GNU date ***
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435: PC: @                0x0 (unknown)
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435: *** SIGTERM (@0x2d2800001166) received by PID 4511 (TID 0x2aaaaaaf9b00) from PID 4454; stack trace: ***
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x5555555f677f google::(anonymous namespace)::FailureSignalHandler()
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x2aaaaacde630 (unknown)
[2021-05-04 18:58:03,389 E 4511 4511] logging.cc:435:     @     0x2aaaaacdde80 __nanosleep_nocancel
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x55555557049b main
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x2aaaab7ab555 __libc_start_main
[2021-05-04 18:58:03,390 E 4511 4511] logging.cc:435:     @     0x555555572d35 (unknown)

==> ./ray_client_server.err <==
/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:10001

==> ./ray_client_server.out <==

==> ./raylet.err <==
/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(

==> ./raylet.out <==
[2021-05-04 18:57:30,605 I 4512 4512] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-04 18:57:30,614 I 4512 4512] store_runner.cc:29: Allowing the Plasma store to use up to 79.4475GB of memory.
[2021-05-04 18:57:30,614 I 4512 4512] store_runner.cc:42: Starting object store with directory /dev/shm and huge page support disabled
[2021-05-04 18:57:31,615 I 4512 4512] grpc_server.cc:71: ObjectManager server started, listening on port 36740.
[2021-05-04 18:57:31,632 I 4512 4512] node_manager.cc:230: Initializing NodeManager with ID 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f
[2021-05-04 18:57:31,632 I 4512 4512] grpc_server.cc:71: NodeManager server started, listening on port 44205.
[2021-05-04 18:57:31,636 I 4512 4560] agent_manager.cc:76: Monitor agent process with pid 4559, register timeout 30000ms.
[2021-05-04 18:57:31,638 I 4512 4512] raylet.cc:146: Raylet of id, 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f started. Raylet consists of node_manager and object_manager. node_manager address: 172.20.201.73:44205 object_manager address: 172.20.201.73:36740 hostname: 172.20.201.73
[2021-05-04 18:57:31,640 I 4512 4512] service_based_accessor.cc:579: Received notification for node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f, IsAlive = 1
[2021-05-04 18:57:32,650 I 4512 4512] agent_manager.cc:32: HandleRegisterAgent, ip: 172.20.201.73, port: 41167, pid: 4559
[2021-05-04 18:58:02,251 I 4512 4512] main.cc:254: Raylet received SIGTERM, shutting down...
[2021-05-04 18:58:02,251 I 4512 4512] service_based_accessor.cc:403: Unregistering node info, node id = 9cb1db2c63275282ec371760325fabdfc4e107cd1ada9ed38ef2406f
[2021-05-04 18:58:02,251 I 4512 4512] io_service_pool.cc:47: IOServicePool is stopped.

==> ./redis.err <==

==> ./redis.out <==
4458:C 04 May 2021 18:57:29.263 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4458:C 04 May 2021 18:57:29.263 # Redis version=6.0.10, bits=64, commit=00000000, modified=0, pid=4458, just started
4458:C 04 May 2021 18:57:29.263 # Configuration loaded
4458:M 04 May 2021 18:57:29.264 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
4458:M 04 May 2021 18:57:29.264 # Server initialized
4458:M 04 May 2021 18:57:29.264 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
4458:M 04 May 2021 18:57:29.264 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
4458:signal-handler (1620172682) Received SIGTERM scheduling shutdown...
4458:M 04 May 2021 18:58:02.615 # User requested shutdown...
4458:M 04 May 2021 18:58:02.615 # Redis is now ready to exit, bye bye...

==> ./redis-shard_0.err <==

==> ./redis-shard_0.out <==
4463:C 04 May 2021 18:57:29.371 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4463:C 04 May 2021 18:57:29.371 # Redis version=6.0.10, bits=64, commit=00000000, modified=0, pid=4463, just started
4463:C 04 May 2021 18:57:29.371 # Configuration loaded
4463:M 04 May 2021 18:57:29.372 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
4463:M 04 May 2021 18:57:29.372 # Server initialized
4463:M 04 May 2021 18:57:29.372 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
4463:M 04 May 2021 18:57:29.372 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
4463:signal-handler (1620172682) Received SIGTERM scheduling shutdown...
4463:M 04 May 2021 18:58:02.644 # User requested shutdown...
4463:M 04 May 2021 18:58:02.644 # Redis is now ready to exit, bye bye...

It seems like these logs don’t have the terminate called after throwing an instance of 'std::system_error'error.

Can you clear /tmp/ray, and run your workload again, and then run tail and post output again?

Sure. I’ll run it again and post the output.

For the output that I posted earlier, I got the heartbeat error which I posted as a comment earlier. I could use some help with debugging that as well.

Thank for all the help and I really appreciate it!

==> ./dashboard.log <==
2021-05-06 11:10:06,525 INFO stats_collector_head.py:186 -- Getting all actor info from GCS.
2021-05-06 11:10:06,526 INFO stats_collector_head.py:284 -- Subscribed to <_Sender name:b'RAY_LOG_CHANNEL', is_pattern:False, receiver:<Receiver is_active:True, senders:1, qsize:0>>
2021-05-06 11:10:06,526 INFO stats_collector_head.py:307 -- Subscribed to b'ERROR_INFO:*'
2021-05-06 11:10:06,526 INFO stats_collector_head.py:212 -- Received 0 actor info from GCS.
2021-05-06 11:10:06,527 INFO job_head.py:89 -- Received 0 job info from GCS.
2021-05-06 11:10:06,963 INFO stats_collector_head.py:296 -- Received a log for 172.20.201.74 and autoscaler
2021-05-06 11:10:08,124 INFO stats_collector_head.py:296 -- Received a log for 172.20.201.74 and raylet
2021-05-06 11:20:06,529 INFO datacenter.py:65 -- Purge data.
2021-05-06 11:30:06,535 INFO datacenter.py:65 -- Purge data.
2021-05-06 11:40:06,537 INFO datacenter.py:65 -- Purge data.
2021-05-06 11:50:06,539 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:00:06,541 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:10:06,545 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:20:06,552 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:30:06,557 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:40:06,562 INFO datacenter.py:65 -- Purge data.
2021-05-06 12:50:06,569 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:00:06,571 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:10:06,572 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:20:06,573 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:30:06,576 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:40:06,578 INFO datacenter.py:65 -- Purge data.
2021-05-06 13:50:06,588 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:00:06,594 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:10:06,602 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:20:06,602 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:30:06,604 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:40:06,605 INFO datacenter.py:65 -- Purge data.
2021-05-06 14:50:06,606 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:00:06,610 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:10:06,619 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:20:06,623 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:30:06,630 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:40:06,634 INFO datacenter.py:65 -- Purge data.
2021-05-06 15:50:06,642 INFO datacenter.py:65 -- Purge data.
2021-05-06 16:00:06,646 INFO datacenter.py:65 -- Purge data.
2021-05-06 16:10:06,647 INFO datacenter.py:65 -- Purge data.
2021-05-06 16:20:06,652 INFO datacenter.py:65 -- Purge data.
2021-05-06 16:30:06,657 INFO datacenter.py:65 -- Purge data.
2021-05-06 16:33:23,681 ERROR stats_collector_head.py:276 -- Error updating node stats of 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb.
Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/new_dashboard/modules/stats_collector/stats_collector_head.py", line 269, in _update_node_stats
    reply = await stub.GetNodeStats(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1620336803.681321755","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5419,"referenced_errors":[{"created":"@1620336803.681318637","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"
>

==> ./gcs_server.err <==

==> ./gcs_server.out <==
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:29:06,552 I 23823 23823] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 7618, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:30:06,556 I 23823 23823] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 7642, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:31:06,556 I 23823 23823] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 7666, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:32:06,556 I 23823 23823] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 7690, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:33:06,556 I 23823 23823] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 7714, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-06 16:33:22,931 I 23823 23823] gcs_node_manager.cc:55: Unregistering node info, node id = 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb
[2021-05-06 16:33:22,931 I 23823 23823] gcs_node_manager.cc:136: Removing node, node id = 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb
[2021-05-06 16:33:22,931 I 23823 23823] gcs_placement_group_manager.cc:532: Node 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb failed, rescheduling the placement groups on the dead node.
[2021-05-06 16:33:22,931 I 23823 23823] gcs_actor_manager.cc:606: Node 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb failed, reconstructing actors.
[2021-05-06 16:33:22,931 I 23823 23823] gcs_node_manager.cc:72: Finished unregistering node info, node id = 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb
[2021-05-06 16:33:23,044 I 23823 23823] gcs_server_main.cc:111: GCS server received SIGTERM, shutting down...
[2021-05-06 16:33:23,044 I 23823 23823] gcs_server.cc:135: Stopping GCS server.
[2021-05-06 16:33:23,049 I 23823 23823] gcs_server.cc:142: GCS server stopped.
[2021-05-06 16:33:23,049 I 23823 23823] io_service_pool.cc:47: IOServicePool is stopped.

==> ./log_monitor.log <==
2021-05-06 11:10:06,960 INFO log_monitor.py:162 -- Beginning to track file raylet.err
2021-05-06 11:10:06,961 INFO log_monitor.py:162 -- Beginning to track file gcs_server.err
2021-05-06 11:10:06,961 INFO log_monitor.py:162 -- Beginning to track file monitor.log

==> ./monitor.err <==
    return self._handle_failure(f"Terminated with signal {sig}\n" +
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1202, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.74:6379. Connection refused.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 376, in <module>
    monitor.run()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 286, in run
    self._handle_failure(traceback.format_exc())
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.74:6379. Connection refused.

==> ./monitor.log <==
2021-05-06 11:10:05,831 INFO monitor.py:122 -- Monitor: Started
2021-05-06 16:33:23,337 ERROR monitor.py:253 -- Error in monitor loop
NoneType: None
2021-05-06 16:33:23,340 ERROR monitor.py:253 -- Error in monitor loop
Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1198, in get_connection
    if connection.can_read():
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 734, in can_read
    return self._parser.can_read(timeout)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 416, in can_read
    return self.read_from_socket(timeout=timeout,
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 429, in read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 202, in _run
    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 272, in _signal_handler
    return self._handle_failure(f"Terminated with signal {sig}\n" +
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/monitor.py", line 264, in _handle_failure
    _internal_kv_put(DEBUG_AUTOSCALING_ERROR, message, overwrite=True)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 56, in _internal_kv_put
    updated = ray.worker.global_worker.redis_client.hset(
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 1202, in get_connection
    connection.connect()
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 172.20.201.74:6379. Connection refused.

==> ./monitor.out <==

==> ./old <==
tail: error reading ‘./old’: Is a directory

==> ./plasma_store.err <==
[2021-05-06 16:33:24,128 E 23867 23867] logging.cc:435: *** Aborted at 1620336804 (unix time) try "date -d @1620336804" if you are using GNU date ***
[2021-05-06 16:33:24,129 E 23867 23867] logging.cc:435: PC: @                0x0 (unknown)
[2021-05-06 16:33:24,129 E 23867 23867] logging.cc:435: *** SIGTERM (@0x2d2800005cff) received by PID 23867 (TID 0x2aaaaaaf9b00) from PID 23807; stack trace: ***
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x5555555f677f google::(anonymous namespace)::FailureSignalHandler()
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x2aaaaacde630 (unknown)
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x2aaaaacdde80 __nanosleep_nocancel
[2021-05-06 16:33:24,135 E 23867 23867] logging.cc:435:     @     0x55555557049b main
[2021-05-06 16:33:24,135 E 23867 23867] logging.cc:435:     @     0x2aaaab7ab555 __libc_start_main
[2021-05-06 16:33:24,136 E 23867 23867] logging.cc:435:     @     0x555555572d35 (unknown)

==> ./plasma_store.out <==
[2021-05-06 11:10:06,619 I 23867 23867] store_exec.cc:81: The Plasma Store is started with the '-z' flag, and it will run idle as a placeholder.
[2021-05-06 16:33:24,128 E 23867 23867] logging.cc:435: *** Aborted at 1620336804 (unix time) try "date -d @1620336804" if you are using GNU date ***
[2021-05-06 16:33:24,129 E 23867 23867] logging.cc:435: PC: @                0x0 (unknown)
[2021-05-06 16:33:24,129 E 23867 23867] logging.cc:435: *** SIGTERM (@0x2d2800005cff) received by PID 23867 (TID 0x2aaaaaaf9b00) from PID 23807; stack trace: ***
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x5555555f677f google::(anonymous namespace)::FailureSignalHandler()
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x2aaaaacde630 (unknown)
[2021-05-06 16:33:24,134 E 23867 23867] logging.cc:435:     @     0x2aaaaacdde80 __nanosleep_nocancel
[2021-05-06 16:33:24,135 E 23867 23867] logging.cc:435:     @     0x55555557049b main
[2021-05-06 16:33:24,135 E 23867 23867] logging.cc:435:     @     0x2aaaab7ab555 __libc_start_main
[2021-05-06 16:33:24,136 E 23867 23867] logging.cc:435:     @     0x555555572d35 (unknown)

==> ./ray_client_server.err <==
/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:10001

==> ./ray_client_server.out <==

==> ./raylet.err <==
/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(

==> ./raylet.out <==
[2021-05-06 11:10:06,629 I 23868 23868] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-06 11:10:06,638 I 23868 23868] store_runner.cc:29: Allowing the Plasma store to use up to 79.4956GB of memory.
[2021-05-06 11:10:06,638 I 23868 23868] store_runner.cc:42: Starting object store with directory /dev/shm and huge page support disabled
[2021-05-06 11:10:07,639 I 23868 23868] grpc_server.cc:71: ObjectManager server started, listening on port 34945.
[2021-05-06 11:10:07,656 I 23868 23868] node_manager.cc:230: Initializing NodeManager with ID 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb
[2021-05-06 11:10:07,656 I 23868 23868] grpc_server.cc:71: NodeManager server started, listening on port 45147.
[2021-05-06 11:10:07,660 I 23868 23917] agent_manager.cc:76: Monitor agent process with pid 23916, register timeout 30000ms.
[2021-05-06 11:10:07,661 I 23868 23868] raylet.cc:146: Raylet of id, 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb started. Raylet consists of node_manager and object_manager. node_manager address: 172.20.201.74:45147 object_manager address: 172.20.201.74:34945 hostname: 172.20.201.74
[2021-05-06 11:10:07,664 I 23868 23868] service_based_accessor.cc:579: Received notification for node id = 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb, IsAlive = 1
[2021-05-06 11:10:08,562 I 23868 23868] agent_manager.cc:32: HandleRegisterAgent, ip: 172.20.201.74, port: 61122, pid: 23916
[2021-05-06 11:20:07,684 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 11:30:07,771 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 11:40:07,847 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 11:50:07,885 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:00:07,969 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:10:08,002 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:20:08,084 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:30:08,182 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:40:08,217 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 12:50:08,223 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:00:08,285 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:10:08,330 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:20:08,429 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:30:08,458 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:40:08,555 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 13:50:08,603 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:00:08,605 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:10:08,687 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:20:08,742 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:30:08,796 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:40:08,855 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 14:50:08,929 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:00:08,934 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:10:09,009 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:20:09,056 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:30:09,071 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:40:09,109 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 15:50:09,119 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 16:00:09,132 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 16:10:09,202 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 16:20:09,301 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 16:30:09,370 I 23868 23868] node_manager.cc:541: Sending Python GC request to 0 local workers to clean up Python cyclic references.
[2021-05-06 16:33:22,931 I 23868 23868] main.cc:254: Raylet received SIGTERM, shutting down...
[2021-05-06 16:33:22,931 I 23868 23868] service_based_accessor.cc:403: Unregistering node info, node id = 2b0737044534403624d8ebdf5244d07d18da84381b03741115f514bb
[2021-05-06 16:33:22,931 I 23868 23868] io_service_pool.cc:47: IOServicePool is stopped.

==> ./redis.err <==

==> ./redis.out <==
23813:C 06 May 2021 11:10:05.180 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
23813:C 06 May 2021 11:10:05.181 # Redis version=6.0.10, bits=64, commit=00000000, modified=0, pid=23813, just started
23813:C 06 May 2021 11:10:05.181 # Configuration loaded
23813:M 06 May 2021 11:10:05.181 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
23813:M 06 May 2021 11:10:05.182 # Server initialized
23813:M 06 May 2021 11:10:05.182 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
23813:M 06 May 2021 11:10:05.182 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
23813:signal-handler (1620336803) Received SIGTERM scheduling shutdown...
23813:M 06 May 2021 16:33:23.174 # User requested shutdown...
23813:M 06 May 2021 16:33:23.174 # Redis is now ready to exit, bye bye...
==> ./redis-shard_0.err <==

==> ./redis-shard_0.out <==
23818:C 06 May 2021 11:10:05.288 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
23818:C 06 May 2021 11:10:05.288 # Redis version=6.0.10, bits=64, commit=00000000, modified=0, pid=23818, just started
23818:C 06 May 2021 11:10:05.288 # Configuration loaded
23818:M 06 May 2021 11:10:05.289 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
23818:M 06 May 2021 11:10:05.289 # Server initialized
23818:M 06 May 2021 11:10:05.289 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
23818:M 06 May 2021 11:10:05.289 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
23818:signal-handler (1620336803) Received SIGTERM scheduling shutdown...
23818:M 06 May 2021 16:33:23.319 # User requested shutdown...
23818:M 06 May 2021 16:33:23.319 # Redis is now ready to exit, bye bye...

hmm so this fails after a couple hours?

maybe there’s some form of a leak somewhere. could you also post the output of dmesg?

Yes. It ran for more than 5hr and failed. Here are some warnings I think are interesting

2me[36m(pid=24146)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

e[2me[36m(pid=24155)e[0m 2021-05-06 16:01:03,185 WARNING util.py:161 -- The `process_trial_result` operation took 18.735 s, which may be a performance bottleneck.

e[2me[36m(pid=24155)e[0m 2021-05-06 16:01:03,185 WARNING util.py:161 -- Processing trial results took 18.735 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.

e[2me[36m(pid=24155)e[0m 2021-05-06 16:01:03,185 WARNING util.py:161 -- The `process_trial` operation took 18.735 s, which may be a performance bottleneck.

e[2me[36m(pid=24147)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.

e[2me[36m(pid=24155)e[0m 2021-05-06 16:01:45,078 WARNING util.py:161 -- The `start_trial` operation took 41.409 s, which may be a performance bottleneck.

2021-05-06 16:02:05,587 WARNING worker.py:1115 -- The actor or task with ID aff58006d215ee60ffffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {CPU_group_0_6e869a94547e08a612bcec7b159f1bc3: 0.010100}, {CPU_group_6e869a94547e08a612bcec7b159f1bc3: 0.010100} for placement, but this node only has remaining {12.000000/24.000000 CPU, 163.125250 GiB/163.125250 GiB memory, 73.902204 GiB/73.902204 GiB object_store_memory, 0.000000/0.480000 CPU_group_0_1157f952dbe5ec7908f5583c3659cf0f, 0.000000/0.480000 CPU_group_0a6c783ad66c540800cccc51b6e2766a, 0.000000/0.480000 CPU_group_0_d696c00770d940d2573693321be667d6, 0.000000/0.480000 CPU_group_0bea306e19ff67f1e87e6c2bf6c62a26, 0.480000/0.480000 CPU_group_0_6e869a94547e08a612bcec7b159f1bc3, 0.000000/0.480000 CPU_group_0_b596623ac2734505a735051f73654232, 0.000000/0.480000 CPU_group_b596623ac2734505a735051f73654232, 0.000000/0.480000 CPU_group_0_0a6c783ad66c540800cccc51b6e2766a, 0.000000/0.480000 CPU_group_0_0bea306e19ff67f1e87e6c2bf6c62a26, 0.000000/0.480000 CPU_group_3b1ea37902ae7c3cca9cfdd605f80263, 0.000000/0.480000 CPU_group_abc6258ea78e1a09735252feafc8e4f0, 0.000000/0.480000 CPU_group_0_29fa629da8e034003476106099746f77, 0.000000/0.480000 CPU_group_29fa629da8e034003476106099746f77, 0.000000/0.480000 CPU_group_0_211a34599e6915df78b832e59f2bd1fe, 0.000000/0.480000 CPU_group_e853e851c6b94e9e1dc151c765dfb2a7, 0.000000/0.480000 CPU_group_d43b552792a1e743aa9c53d8afbc8cec, 0.000000/0.480000 CPU_group_0_d43b552792a1e743aa9c53d8afbc8cec, 0.000000/0.480000 CPU_group_1067e87b1045860fa4a73044ab897e55, 0.000000/0.480000 CPU_group_77412c5e8aa194439d31c4a551b9c563, 0.000000/0.480000 CPU_group_0_bc1e88ee8a5bc323772d54a364c5fcc1, 1.000000/1.000000 node:172.20.201.74, 0.000000/0.480000 CPU_group_0_c8a85dd2b89d3dc0483609617dc8de34, 0.000000/0.480000 CPU_group_0_3b1ea37902ae7c3cca9cfdd605f80263, 0.000000/0.480000 CPU_group_0_1067e87b1045860fa4a73044ab897e55, 0.000000/0.480000 CPU_group_9b7f04fc175476f374f36354c225c7c6, 0.000000/0.480000 CPU_group_0_9b7f04fc175476f374f36354c225c7c6, 0.000000/0.480000 CPU_group_0_292dcefe45c4b8053a9ea5c079a647cd, 0.000000/0.480000 CPU_group_b82dd4fd7167332ee3fbfe2a8f14153e, 0.000000/0.480000 CPU_group_0_e853e851c6b94e9e1dc151c765dfb2a7, 0.000000/0.480000 CPU_group_0_4e0c108f8623cbb79f624db35d276e70, 0.000000/0.480000 CPU_group_e7c94681ac7665a3377ffb9c3373c9c7, 0.000000/0.480000 CPU_group_0_77412c5e8aa194439d31c4a551b9c563, 0.000000/0.480000 CPU_group_0_a8d79e94e777b89ded5f6d4ff6c904dd, 0.000000/0.480000 CPU_group_a8d79e94e777b89ded5f6d4ff6c904dd, 0.000000/0.480000 CPU_group_0_b82dd4fd7167332ee3fbfe2a8f14153e, 0.000000/0.480000 CPU_group_e4a241ca391cf0b081e5b9364876ad4f, 0.000000/0.480000 CPU_group_d696c00770d940d2573693321be667d6, 0.480000/0.480000 CPU_group_6e869a94547e08a612bcec7b159f1bc3, 0.000000/0.480000 CPU_group_292dcefe45c4b8053a9ea5c079a647cd, 0.000000/0.480000 CPU_group_4e0c108f8623cbb79f624db35d276e70, 0.000000/0.480000 CPU_group_1157f952dbe5ec7908f5583c3659cf0f, 0.000000/0.480000 CPU_group_479a4b3cdfac138b9b22e90a9b5e1ed5, 0.000000/0.480000 CPU_group_bc1e88ee8a5bc323772d54a364c5fcc1, 0.000000/0.480000 CPU_group_a8714f3ac49834572bd6a7d9148d86b8, 0.000000/0.480000 CPU_group_0_a8714f3ac49834572bd6a7d9148d86b8, 0.000000/0.480000 CPU_group_0_abc6258ea78e1a09735252feafc8e4f0, 0.000000/0.480000 CPU_group_c8a85dd2b89d3dc0483609617dc8de34, 0.000000/0.480000 CPU_group_0_479a4b3cdfac138b9b22e90a9b5e1ed5, 0.000000/0.480000 CPU_group_211a34599e6915df78b832e59f2bd1fe, 0.000000/0.480000 CPU_group_0_e4a241ca391cf0b081e5b9364876ad4f, 0.000000/0.480000 CPU_group_0_e7c94681ac7665a3377ffb9c3373c9c7}

. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
[2me[33m(raylet)e[0m [2021-05-06 16:18:29,701 C 24066 24066] worker_pool.cc:356: Failed to start worker with return value system:11: Resource temporarily unavailable
e[2me[33m(raylet)e[0m [2021-05-06 16:18:29,702 E 24066 24066] logging.cc:435: *** Aborted at 1620335909 (unix time) try "date -d @1620335909" if you are using GNU date ***
e[2me[33m(raylet)e[0m [2021-05-06 16:18:29,702 E 24066 24066] logging.cc:435: PC: @                0x0 (unknown)
e[2me[36m(pid=24147)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
e[2me[36m(pid=24155)e[0m 2021-05-06 16:19:00,416	WARNING util.py:161 -- The `start_trial` operation took 65.720 s, which may be a performance bottleneck.
e[2me[36m(pid=24136)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
e[2me[33m(raylet)e[0m [2021-05-06 16:20:45,475 E 24066 24066] logging.cc:435: *** SIGABRT (@0x2d2800005e02) received by PID 24066 (TID 0x2aaaaaafa240) from PID 24066; stack trace: ***
e[2me[36m(pid=24141)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
e[2me[36m(pid=24148)e[0m Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
[2me[33m(raylet)e[0m terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::system_errorstd::system_error'
e[2me[33m(raylet)e[0m '
e[2me[33m(raylet)e[0m   what():    what():  Resource temporarily unavailableResource temporarily unavailable
e[2me[33m(raylet)e[0m 
e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'std::system_error'
e[2me[33m(raylet)e[0m   what():  Resource temporarily unavailable
e[2me[33m(raylet)e[0m terminate called after throwing an instance of 'terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::system_errorstd::system_errorstd::system_error'
e[2me[33m(raylet)e[0m '
e[2me[33m(raylet)e[0m '
e[2me[33m(raylet)e[0m   what():    what():    what():  Resource temporarily unavailableResource temporarily unavailableResource temporarily unavailable
e

Here’s a part of output of dmesg

24378377.083743] type=2404 audit(1620356811.436:20483585): pid=4119 uid=0 auid=10375 ses=806504 msg='op=destroy kind=session fp=? direction=from-server spid=4188 suid=10375 rport=58114 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378379.595559] type=2407 audit(1620356813.948:20483586): pid=4115 uid=0 auid=10375 ses=806503 msg='op=start direction=from-server cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378379.595614] type=2407 audit(1620356813.948:20483587): pid=4115 uid=0 auid=10375 ses=806503 msg='op=start direction=from-client cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378379.660541] type=2404 audit(1620356814.013:20483588): pid=4115 uid=0 auid=10375 ses=806503 msg='op=destroy kind=session fp=? direction=from-client spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378380.176949] type=2404 audit(1620356814.530:20483589): pid=4115 uid=0 auid=10375 ses=806503 msg='op=destroy kind=session fp=? direction=from-server spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378394.902944] type=2407 audit(1620356829.257:20483590): pid=45747 uid=0 auid=11342 ses=803554 msg='op=start direction=from-server cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=45765 suid=11342 rport=42103 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.148.194 terminal=? res=success'
[24378394.902992] type=2407 audit(1620356829.257:20483591): pid=45747 uid=0 auid=11342 ses=803554 msg='op=start direction=from-client cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=45765 suid=11342 rport=42103 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.148.194 terminal=? res=success'
[24378394.911793] type=2404 audit(1620356829.266:20483592): pid=45747 uid=0 auid=11342 ses=803554 msg='op=destroy kind=session fp=? direction=from-client spid=45765 suid=11342 rport=42103 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.148.194 terminal=? res=success'
[24378394.922216] type=2404 audit(1620356829.276:20483593): pid=45747 uid=0 auid=11342 ses=803554 msg='op=destroy kind=session fp=? direction=from-server spid=45765 suid=11342 rport=42103 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.148.194 terminal=? res=success'
[24378398.432542] type=2407 audit(1620356832.787:20483594): pid=4115 uid=0 auid=10375 ses=806503 msg='op=start direction=from-server cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378398.432592] type=2407 audit(1620356832.787:20483595): pid=4115 uid=0 auid=10375 ses=806503 msg='op=start direction=from-client cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378398.495902] type=2404 audit(1620356832.850:20483596): pid=4115 uid=0 auid=10375 ses=806503 msg='op=destroy kind=session fp=? direction=from-client spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378399.028577] type=2404 audit(1620356833.383:20483597): pid=4115 uid=0 auid=10375 ses=806503 msg='op=destroy kind=session fp=? direction=from-server spid=4175 suid=10375 rport=58113 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378399.744670] type=2407 audit(1620356834.099:20483598): pid=4119 uid=0 auid=10375 ses=806504 msg='op=start direction=from-server cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4188 suid=10375 rport=58114 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378399.744723] type=2407 audit(1620356834.099:20483599): pid=4119 uid=0 auid=10375 ses=806504 msg='op=start direction=from-client cipher=aes256-gcm@openssh.com ksize=256 mac=<implicit> pfs=curve25519-sha256@libssh.org spid=4188 suid=10375 rport=58114 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'
[24378400.327677] audit_printk_skb: 2 callbacks suppressed
[24378400.327680] type=2404 audit(1620356834.682:20483601): pid=4119 uid=0 auid=10375 ses=806504 msg='op=destroy kind=session fp=? direction=from-server spid=4188 suid=10375 rport=58114 laddr=10.111.161.26 lport=22  exe="/usr/sbin/sshd" hostname=? addr=138.26.17.71 terminal=? res=success'

Thanks for helping me out!

hmm i suspect maybe this is like files or threads being leaked… @Alex or @sangcho could you take a look when you get the chance?

What’s the version of Ray? Also did you observe the memory usage when it was running 3~4 hours? (are they increasing?)

I’m using Ray-1.3.0

Sorry, I couldn’t check the memory usage while running as I’m running on a cluster.