(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Can anyone help me? I ran ray start --head, and it gave me:
image

I ran ray status, it gave me:
image

Then a python script was run. And it gave me the error on the top picture. It seems like a ray problem. Maybe related with cpu numbers? I am using a slurm cluster by running command blow:

srun -p caif_dev --gres=gpu:1 -n1 bash test_install.sh

and the test_install.sh is:

ray start --head
ray status
XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" python /mnt/cache/zhangyuchang/alpa-project/alpa/tests/test_install.py
ray stop

Hi @zyc-bit, can you check if the mentioned process (68497) is still alive, and get its stack trace with py-spy? The process might have crashed for some reason. You can also look in /tmp/ray/session_latest and try to find the log file with name containing 68497. If there is a log file, it may contain the reason why the worker is having troubles.

Hi @Mingwei ,thank you for your reply.
I looked in /tme/ray/session_latest and found that there is no file with name containing 68497.


And in /tme/ray/session_latest , there is not even a record in this directory that I just ran today. I’ll post the output I just ran today below.
And I tryed use py-spy you mentioned above. I ran py-spy record --pid 68497, it gave me:
image

I post my newtest output below.

The error is reported at the beginning of the (raylet) line, and there is no record of this time in the path /tmp/ray/session_latest or /tmp/ray/session_xxx/ at the end of the run. Although I saw the information in the out like:

2022-05-26 11:01:40,158 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
(0524alpa) 
# zhangyuchang @ SH-IDC1-10-140-0-32 in /mnt/cache/zhangyuchang/alpa-project-new/alpa on git:main x [10:51:01] 
$ TF_CPP_MIN_LOG_LEVEL=0 XLA_FLAGS="--xla_gpu_cuda_data_dir=/mnt/cache/share/platform/dep/cuda11.2-cudnn8.1.1" srun -p caif_dev --gres=gpu:1 --ntasks-per-node=1 -n1 bash test_install.sh
phoenix-srun: Job 2047274 scheduled successfully!

Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
2022-05-26 10:51:51,826 INFO services.py:1462 -- View the Ray dashboard at http://127.0.0.1:8265
2022-05-26 10:51:28,020 INFO scripts.py:697 -- Local node IP: 10.140.1.35
2022-05-26 10:51:52,889 SUCC scripts.py:739 -- --------------------
2022-05-26 10:51:52,895 SUCC scripts.py:740 -- Ray runtime started.
2022-05-26 10:51:52,895 SUCC scripts.py:741 -- --------------------
2022-05-26 10:51:52,895 INFO scripts.py:743 -- Next steps
2022-05-26 10:51:52,895 INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2022-05-26 10:51:52,895 INFO scripts.py:749 --   ray start --address='10.140.1.35:6379'
2022-05-26 10:51:52,895 INFO scripts.py:752 -- Alternatively, use the following Python code:
2022-05-26 10:51:52,895 INFO scripts.py:754 -- import ray
2022-05-26 10:51:52,895 INFO scripts.py:767 -- ray.init(address='auto')
2022-05-26 10:51:52,895 INFO scripts.py:771 -- To connect to this Ray runtime from outside of the cluster, for example to
2022-05-26 10:51:52,895 INFO scripts.py:775 -- connect to a remote cluster from your laptop directly, use the following
2022-05-26 10:51:52,895 INFO scripts.py:778 -- Python code:
2022-05-26 10:51:52,895 INFO scripts.py:780 -- import ray
2022-05-26 10:51:52,896 INFO scripts.py:786 -- ray.init(address='ray://<head_node_ip_address>:10001')
2022-05-26 10:51:52,896 INFO scripts.py:792 -- If connection fails, check your firewall settings and network configuration.
2022-05-26 10:51:52,896 INFO scripts.py:798 -- To terminate the Ray runtime, run
2022-05-26 10:51:52,896 INFO scripts.py:799 --   ray stop
succeed===========
======== Autoscaler status: 2022-05-26 10:52:00.703027 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_0ca17cc9a4cbd848e7533201da8225eaf45b5e75aa1e695568e6338e
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/1.0 GPU
 0.00/383.261 GiB memory
 0.00/168.246 GiB object_store_memory

Demands:
 (no resource demands)
now running python script
2022-05-26 10:52:13.094558: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-26 10:52:22.084142: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-05-26 10:53:37.499963: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x555df8aeb380 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2022-05-26 10:53:37.500012: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182]   StreamExecutor device (0): Interpreter, <undefined>
2022-05-26 10:53:37.571849: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/tfrt_cpu_pjrt_client.cc:176] TfrtCpuClient created.
2022-05-26 10:53:38.277168: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:174] XLA service 0x555df907f1c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-05-26 10:53:38.277227: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:182]   StreamExecutor device (0): A100-SXM-80GB, Compute Capability 8.0
2022-05-26 10:53:38.278839: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/gpu_device.cc:341] Using platform allocator.
2022-05-26 10:53:38.297082: I external/org_tensorflow/tensorflow/stream_executor/tpu/tpu_platform_interface.cc:74] No TPU platform found.
.2022-05-26 10:56:08.136490: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/service.cc:369] Jax service listening on 10.140.1.35:20020

2022-05-26 11:01:14.346240: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/service.cc:381] Jax service shutting down
2022-05-26 11:01:14.354351: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/service.cc:381] Jax service shutting down
(pid=107432) 2022-05-26 10:54:44.976064: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(pid=107432) 2022-05-26 10:54:51.285440: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(raylet) [2022-05-26 10:56:40,707 E 104827 104827] (raylet) worker_pool.cc:518: Some workers of the worker process(109411) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(raylet) [2022-05-26 10:57:11,579 E 104827 104827] (raylet) worker_pool.cc:518: Some workers of the worker process(109736) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(raylet) [2022-05-26 10:57:43,272 E 104827 104827] (raylet) worker_pool.cc:518: Some workers of the worker process(110164) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(pid=110494) 2022-05-26 10:57:53.987877: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(pid=110494) 2022-05-26 10:57:57.747614: I external/org_tensorflow/tensorflow/core/util/util.cc:168] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(MeshHostWorker pid=110494) 2022-05-26 11:01:07.060563: I external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:166] Connect failed() with status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=110494) 2022-05-26 11:01:07.094402: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/distributed/client.cc:177] Connect() failed after 1 retries in 0; most recent failure status: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=110494) 2022-05-26 11:01:09,339     ERROR worker.py:449 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::MeshHostWorker.__init__() (pid=110494, ip=10.140.1.35, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f88b8089890>)
(MeshHostWorker pid=110494)   File "/mnt/cache/zhangyuchang/alpa-project-new/alpa/alpa/device_mesh.py", line 96, in __init__
(MeshHostWorker pid=110494)     self.distributed_client.connect()
(MeshHostWorker pid=110494) RuntimeError: DEADLINE_EXCEEDED: Connect() timed out after 0 with 1 attempts. Most recent failure was: DEADLINE_EXCEEDED: Deadline Exceeded
(MeshHostWorker pid=110494) E0526 11:01:12.421620596  110604 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E
======================================================================


FAILED (errors=1)
2022-05-26 11:01:40,110 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
2022-05-26 11:01:40,111 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --config_list=eyJvYmplY3Rfc3BpbGxpbmdfY29uZmlnIjogIntcInR5cGVcIjogXCJmaWxlc3lzdGVtXCIsIFwicGFyYW1zXCI6IHtcImRpcmVjdG9yeV9wYXRoXCI6IFwiL3RtcC9yYXkvc2Vzc2lvbl8yMDIyLTA1LTI2XzEwLTUxLTI4XzA3ODk1M18xMDQzNjFcIn19IiwgImlzX2V4dGVybmFsX3N0b3JhZ2VfdHlwZV9mcyI6IHRydWV9 --gcs_server_port=6379 --metrics-agent-port=62967 --node-ip-address=10.140.1.35 --redis_password=5241590000000000` (via SIGTERM)
2022-05-26 11:01:40,115 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 --redis-password=5241590000000000 --monitor-ip=10.140.1.35` (via SIGTERM)
2022-05-26 11:01:40,116 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/_private/log_monitor.py --logs-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --gcs-address=10.140.1.35:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5` (via SIGTERM)
2022-05-26 11:01:40,120 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -m ray.util.client.server --address=10.140.1.35:6379 --host=0.0.0.0 --port=10001 --mode=proxy --redis-password=5241590000000000 --metrics-agent-port=62967` (via SIGTERM)
2022-05-26 11:01:40,129 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
2022-05-26 11:01:40,134 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
2022-05-26 11:01:40,139 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
2022-05-26 11:01:40,145 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/_private/log_monitor.py --logs-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --gcs-address=10.140.1.35:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5` (via SIGTERM)
2022-05-26 11:01:40,153 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=localhost --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379` (via SIGTERM)
2022-05-26 11:01:40,158 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --store_socket_name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.140.1.35 --maximum_startup_concurrency=128 --static_resource_list=node:10.140.1.35,1.0,CPU,128,GPU,1,memory,411523096781,object_store_memory,180652755763 "--python_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.140.1.35 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=62967 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" "--java_worker_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/workers/setup_worker.py java -Dray.address=10.140.1.35:6379 -Dray.raylet.node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER -Dray.object-store.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store -Dray.raylet.socket-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet -Dray.redis.password=5241590000000000 -Dray.node-ip=10.140.1.35 -Dray.home=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/../.. -Dray.logging.dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs -Dray.session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 -cp /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/jars/* RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER io.ray.runtime.runner.worker.DefaultWorker" --cpp_worker_command= --native_library_path=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --log_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --resource_dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --metrics-agent-port=62967 --metrics_export_port=64128 --object_store_memory=180652755763 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=10.140.1.35:6379 "--agent_command=/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379"` (via SIGTERM)
2022-05-26 11:01:40,158 VINFO scripts.py:988 -- Send termination request to `/mnt/lustre/zhangyuchang/.conda/envs/0524alpa/bin/python -u /mnt/lustre/zhangyuchang/.conda/envs/0524alpa/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.140.1.35 --metrics-export-port=64128 --dashboard-agent-port=62967 --listen-port=0 --node-manager-port=39616 --object-store-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361 --runtime-env-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/runtime_resources --log-dir=/tmp/ray/session_2022-05-26_10-51-28_078953_104361/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=10.140.1.35:6379` (via SIGTERM)
2022-05-26 11:01:46,143 SUCC scripts.py:1033 -- Stopped all 7 Ray processes.