- Where would I find that?
- Pasting the contents of
/tmp/ray/session_2022-10-24_13-45-31_177186_217110/logs/python-core-worker-b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5_200902.log
below.
As for why, I wonder if there’s some issue related to having two separate ray instances start on the same machine at the same time, or even just having two run at the same time? E.g. is anything trying to write to /temp/ray/session_latest/
, and if there’s two Ray instances, the second one starting up will have that link point to the wrong directory? Just a theory, but I feel like I’ve had a few (seemingly) separate issues pop up on our slurm cluster, and the main thing I can think of that’s different there compared to running on AWS is that you’ll have multiple jobs running on the same physical machine.
[2022-10-24 18:11:30,439 I 200902 200902] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 200902
[2022-10-24 18:11:30,443 I 200902 200902] grpc_server.cc:105: worker server started, listening on port 11980.
[2022-10-24 18:11:30,447 I 200902 200902] core_worker.cc:185: Initializing worker at address: 10.31.143.135:11980, worker ID b3d672e65bce090c0274fb735a884d98d101f97ff36d5de1495c9cd5, raylet 2a56d398c08a21530fbdb76959d264932f0cab1b3d0df5d0bd9a7037
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = ebf20e2dee56623a5f387e39928d1ce00e1f391c05edb239b7ecc174, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = e4262b109266027df84a96a9bf2a6e949fd19a675ebd8e7531d6eb23, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = a0aa52228a9483d3f593e6f50da95918f3f3f507a19588b2e5aa523c, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 84a166705db6e9ab1e92186a327d23337b440ce803d4c4951f7368c2, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = bd21f2f933925edf87a95e912efea830d434468b4fbdec7e50925099, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = dcdc67122accc266554ed35848f662264cb73f74db6d00be14596f9a, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 57c8a5c20c97719eea9da886974c4480306f1133afbe066b3063ed46, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = fba9388101ab6cf948a7597adc4e72d84b44b63fa9aeb459d75081e9, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 6c92add3574d88ee4cb67f85d984babff29a8facb14981f1d0dbefb2, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = cfcd747dfb48a314d0e0e8644a02504d90ef74a15d6c692a4638d88b, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 9679dc9995cc0c7dab5c25ce55acad078f8ac50b368c420997ebe0ed, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] accessor.cc:608: Received notification for node id = 2a56d398c08a21530fbdb76959d264932f0cab1b3d0df5d0bd9a7037, IsAlive = 1
[2022-10-24 18:11:30,449 I 200902 200954] core_worker.cc:476: Event stats:
Global stats: 11 total (5 active)
Queueing time: mean = 101.253 us, max = 974.529 us, min = 8.089 us, total = 1.114 ms
Execution time: mean = 32.921 us, total = 362.129 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 4 total (1 active, 1 running), CPU time: mean = 7.850 us, total = 31.399 us
UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 101.959 us, total = 101.959 us
CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 183.898 us, total = 183.898 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 44.873 us, total = 44.873 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[2022-10-24 18:11:30,449 I 200902 200902] core_worker.cc:521: Adjusted worker niceness to 15
[2022-10-24 18:11:30,450 I 200902 200902] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-10-24 18:11:32,771 I 200902 200902] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 97a5c506632456b6ac28fa3101000000
[2022-10-24 18:11:32,775 I 200902 200954] actor_manager.cc:214: received notification on actor, state: PENDING_CREATION, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16600, worker_id: NIL_ID, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-10-24 18:11:42,318 I 200902 200954] actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16615, worker_id: 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2022-10-24 18:11:42,319 I 200902 200954] direct_actor_task_submitter.cc:229: Connecting to actor 97a5c506632456b6ac28fa3101000000 at worker 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3
[2022-10-24 18:11:42,462 I 200902 200954] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 97a5c506632456b6ac28fa3101000000, ip address: 10.31.133.85, port: 16615, worker_id: 0b3f3adc815ab0d2f3655927085368e1feeb6f4d51c355b38bacccb3, raylet_id: b7c8be5811db6367b03e50a78813492f2763393311d6328d38ea20ce, num_restarts: 0, death context type=ActorDiedErrorContext
[2022-10-24 18:11:42,462 I 200902 200954] direct_actor_task_submitter.cc:278: Failing pending tasks for actor 97a5c506632456b6ac28fa3101000000 because the actor is already dead.
[2022-10-24 18:11:43,579 I 200902 200954] core_worker.cc:606: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=INTENDED_SYSTEM_EXIT
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:593: Disconnecting to the raylet.
[2022-10-24 18:11:43,579 I 200902 200902] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because it was idle (it doesn't have objects it owns while no task or actor has been scheduled) for a long time., has creation_task_exception_pb_bytes=0
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:540: Shutting down a core worker.
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:564: Disconnecting a GCS client.
[2022-10-24 18:11:43,579 I 200902 200902] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-10-24 18:11:43,579 I 200902 200954] core_worker.cc:691: Core worker main io service stopped.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker.cc:577: Core worker ready to be deallocated.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker_process.cc:240: Task execution loop terminated. Removing the global worker.
[2022-10-24 18:11:43,580 I 200902 200902] core_worker.cc:531: Core worker is destructed
[2022-10-24 18:11:43,718 I 200902 200902] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 200902
[2022-10-24 18:11:43,722 I 200902 200902] io_service_pool.cc:47: IOServicePool is stopped.