Log monitor failing

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I just had a ray tune run fail with the following error. Any ideas what might cause this? The only thing I noticed was that this might have been at just the same time as a new worker connected to the cluster. It’s possible that two workers were starting up at the same time on the same physical machine, and it’s also possible that one or more workers were starting up on the same physical machine that the ray head node was running on. This is unfortunately something I cannot control. Anything I can do to avoid this error?

911166-The log monitor on node ... failed with the following error:
911167-Traceback (most recent call last):
911168-  File ".../lib/python3.9/site-packages/ray/_private/log_monitor.py", line 520, in <module>
911169-    log_monitor.run()
911170-  File ".../lib/python3.9/site-packages/ray/_private/log_monitor.py", line 436, in run
911171-    anything_published = self.check_log_files_and_publish_updates()
911172-  File ".../lib/python3.9/site-packages/ray/_private/log_monitor.py", line 339, in check_log_files_and_publish_updates
911173-    file_info.reopen_if_necessary()
911174-  File ".../lib/python3.9/site-packages/ray/_private/log_monitor.py", line 79, in reopen_if_necessary
911175-    new_inode = os.stat(self.filename).st_ino
911176:FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2022-10-24_13-45-31_177186_217110/logs/worker-b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5-01000000-200902.out'
911177-
911178-Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
911179-2022-10-24 18:15:50,522  WARNING dataclient.py:395 -- Encountered connection issues in the data channel. Attempting to reconnect.
911180-2022-10-24 18:16:20,549  WARNING dataclient.py:402 -- Failed to reconnect the data channel
938783-Traceback (most recent call last):
938784-  File ".../main.py", line 519, in <module>
938785-    main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
938786-  File ".../main.py", line 495, in main
938787-    tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
938788-  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 420, in run
938789-    while ray.wait([remote_future], timeout=0.2)[1]:
938790-  File ".../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
938791-    return getattr(ray, func.__name__)(*args, **kwargs)
938792-  File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 61, in wait
938793-    return self.worker.wait(*args, **kwargs)
938794-  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 531, in wait
938795-    resp = self._call_stub("WaitObject", req, metadata=self.metadata)
938796-  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 307, in _call_stub
938797:    raise ConnectionError("Client is shutting down.")
938798:ConnectionError: Client is shutting down.

Hm, let me see if the observability folks have any insight here.

cc @sangcho @rickyyx

@mgerstgrasser Thanks for the question and the observation!

A couple of questions:

  1. Are you able to gather the newly joined worker’s worker id?
  2. Could you also check the core worker side log file for that particular worker: /tmp/ray/session_2022-10-24_13-45-31_177186_217110/logs/python-core-worker-b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5_200902.* in the log directory? Wondering why the worker log file was gone in the original issue.
1 Like
  1. Where would I find that?
  2. Pasting the contents of /tmp/ray/session_2022-10-24_13-45-31_177186_217110/logs/python-core-worker-b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5_200902.log below.

As for why, I wonder if there’s some issue related to having two separate ray instances start on the same machine at the same time, or even just having two run at the same time? E.g. is anything trying to write to /temp/ray/session_latest/, and if there’s two Ray instances, the second one starting up will have that link point to the wrong directory? Just a theory, but I feel like I’ve had a few (seemingly) separate issues pop up on our slurm cluster, and the main thing I can think of that’s different there compared to running on AWS is that you’ll have multiple jobs running on the same physical machine.

[2022-10-24 18:11:30,439 I 200902 200902] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 200902
[2022-10-24 18:11:30,443 I 200902 200902] grpc_server.cc:105: worker server started, listening on port 11980.
[2022-10-24 18:11:30,447 I 200902 200902] core_worker.cc:185: Initializing worker at address: 10.31.143.135:11980, worker ID b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5, raylet 2a56d398c08a21530fbdb76959d264932f0cab1b3d0df5d0bd9a7037
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = ebf20e2dee56623a5f387e39928d1ce00e1f391c05edb239b7ecc174, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = e4262b109266027df84a96a9bf2a6e949fd19a675ebd8e7531d6eb23, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = a0aa52228a9483d3f593e6f50da95918f3f3f507a19588b2e5aa523c, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 84a166705db6e9ab1e92186a327d23337b440ce803d4c4951f7368c2, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = bd21f2f933925edf87a95e912efea830d434468b4fbdec7e50925099, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = dcdc67122accc266554ed35848f662264cb73f74db6d00be14596f9a, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 57c8a5c20c97719eea9da886974c4480306f1133afbe066b3063ed46, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = fba9388101ab6cf948a7597adc4e72d84b44b63fa9aeb459d75081e9, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 6c92add3574d88ee4cb67f85d984babff29a8facb14981f1d0dbefb2, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = cfcd747dfb48a314d0e0e8644a02504d90ef74a15d6c692a4638d88b, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 9679dc9995cc0c7dab5c25ce55acad078f8ac50b368c420997ebe0ed, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 2a56d398c08a21530fbdb76959d264932f0cab1b3d0df5d0bd9a7037, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] core_worker.cc:476: Event stats:


Global stats: 11 total (5 active)
Queueing time: mean = 101.253 us, max = 974.529 us, min = 8.089 us, total = 1.114 ms
Execution time:  mean = 32.921 us, total = 362.129 us
Event stats:
	PeriodicalRunner.RunFnPeriodically - 4 total (1 active, 1 running), CPU time: mean = 7.850 us, total = 31.399 us
	UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 101.959 us, total = 101.959 us
	CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 183.898 us, total = 183.898 us
	WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 44.873 us, total = 44.873 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-10-24 18:11:30,449 I 200902 200902] core_worker.cc:521: Adjusted worker niceness to 15
[2022-10-24 18:11:30,450 I 200902 200902] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-10-24 18:11:32,771 I 200902 200902] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 97a5c506632456b6ac28fa3101000000
[2022-10-24 18:11:32,775 I 200902 200954] actor_manager.cc:214: received notification on actor, state: PENDING_CREATION, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16600, worker_id: NIL_ID, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-10-24 18:11:42,318 I 200902 200954] actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16615, worker_id: 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-10-24 18:11:42,319 I 200902 200954] direct_actor_task_submitter.cc:229: Connecting to actor 97a5c506632456b6ac28fa3101000000 at worker 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3
[2022-10-24 18:11:42,462 I 200902 200954] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16615, worker_id: 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=ActorDiedErrorContext
[2022-10-24 18:11:42,462 I 200902 200954] direct_actor_task_submitter.cc:278: Failing pending tasks for actor 97a5c506632456b6ac28fa3101000000 because the actor is already dead.
[2022-10-24 18:11:43,579 I 200902 200954] core_worker.cc:606: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:593: Disconnecting to the raylet.
[2022-10-24 18:11:43,579 I 200902 200902] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time., has creation_task_exception_pb_bytes=0
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:540: Shutting down a core worker.
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:564: Disconnecting a GCS client.
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-10-24 18:11:43,579 I 200902 200954] core_worker.cc:691: Core worker main io service stopped.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker.cc:577: Core worker ready to be deallocated.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker_process.cc:240: Task execution loop terminated. Removing the global worker.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker.cc:531: Core worker is destructed
[2022-10-24 18:11:43,718 I 200902 200902] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 200902
[2022-10-24 18:11:43,722 I 200902 200902] io_service_pool.cc:47: IOServicePool is stopped.

Hey @mgerstgrasser - sorry for the late reply. It somehow slipped. I think this is highly likely. Essentially a race on the /session_latest/logs folder. We are revisiting this issue right now as well, and will post a tracking issue ASAP.

Tracking issue: [core][cluster] log monitor errors when os stat fails on worker log · Issue #31119 · ray-project/ray · GitHub

1 Like