Built an unavailable wheel with doc

Hello guys, have a nice day!

I have a problem within building ray python wheels. I have tried every method i can thought to solve this problem before asking help from community. Hope the community can give some suggestions or inspiration.

  1. I built my version ray wheel on my Ubuntu laptop after modify some source code in the dashboard module, job related function, in order to add a feature: persistent job info in JobInfoStorageClient.

    I followed the doc in python/build-wheel-manylinux2014.sh to build wheels. I found the size of wheel i build is smaller than the offical release version a lot. My branch is base on releases/2.0.0 . Here are the wheel i build on my laptop:

INFO     .whl/ray-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (29.1 MB)                                                             
INFO     .whl/ray-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (29.0 MB)                                                              
INFO     .whl/ray-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (29.1 MB)                                                              
INFO     .whl/ray-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (29.1 MB)                                                               
INFO     .whl/ray-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (29.1 MB)                                                               
INFO     .whl/ray_cpp-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (21.3 MB)                                                         
INFO     .whl/ray_cpp-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (21.7 MB)                                                          
INFO     .whl/ray_cpp-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (21.3 MB)                                                          
INFO     .whl/ray_cpp-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (21.3 MB)                                                           
INFO     .whl/ray_cpp-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (21.3 MB)   

However, the offical release version is bigger than mine.For example, python3.8 + linux + ray2.0.0 is Downloading ray-2.0.0-cp38-cp38-manylinux2014_x86_64.whl (59.2 MB)

Is this normal ?

  1. when i install the ray wheel which i built by the command ray start --head --dashboard-host 0.0.0.0 --dashboard-port 8265 --block, Strange things happened. After about 3 minutes, the process exited. Here is the stdout:
...
Some Ray subprocesses exited unexpectedly:
  dashboard [exit code=255]

Remaining processes will be killed.
  1. I found these log in session_latest/logs/dashboard.log
...
2022-12-20 21:59:21,240 INFO http_server_head.py:142 -- Registered 51 routes.
2022-12-20 21:59:21,242 INFO datacenter.py:70 -- Purge data.
2022-12-20 21:59:21,242 INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.0622056 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['GCS'].
2022-12-20 21:59:21,244 INFO usage_stats_head.py:102 -- Usage reporting is enabled.
2022-12-20 21:59:21,244 INFO actor_head.py:111 -- Getting all actor info from GCS.
2022-12-20 21:59:21,246 INFO actor_head.py:137 -- Received 0 actor info from GCS.
2022-12-20 21:59:32,244 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 21:59:48,245 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:04,248 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:20,252 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:36,255 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:52,257 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:08,260 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:24,263 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:40,267 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:56,270 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:138 -- Dashboard exiting because it received too many GCS RPC errors count: 11, threshold is 10.

session_latest/logs/dashboard_agent.log

...
2022-12-20 21:59:23,084	INFO event_agent.py:46 -- Report events to 10.9.2.41:34684
2022-12-20 21:59:23,084	INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.9415762 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['COMMON', 'CORE_WORKER', 'RAYLET'].
2022-12-20 22:02:13,087	ERROR reporter_agent.py:809 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 806, in _perform_iteration
    await publisher.publish_resource_usage(self._key, jsonify_asdict(stats))
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 452, in publish_resource_usage
    await self._stub.GcsPublish(req)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1671544933.087241918","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1671544933.087241207","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-12-20 22:02:13,602	ERROR agent.py:217 -- Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms
    [state-dump] 	NodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us
    [state-dump] 	NodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms
    [state-dump] 	NodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms
    [state-dump] 	PeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us
    [state-dump] 	NodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms
    [state-dump] 	AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us
    [state-dump] 	JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 

2022-12-20 22:03:13,680	ERROR utils.py:224 -- Failed to publish error job_id: "\377\377\377\377"
type: "raylet_died"
error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n    [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n    [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n    [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n    [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n    [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n    [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n    [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n    [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n    [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n    [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n    [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n    [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n    [state-dump] DebugString() time ms: 0\n    [state-dump] \n    [state-dump] \n"
timestamp: 1671544933.6033757
Traceback (most recent call last):
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/utils.py", line 222, in publish_error_to_driver
    gcs_publisher.publish_error(job_id.hex().encode(), error_data)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 169, in publish_error
    self._gcs_publish(req)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 191, in _gcs_publish
    raise TimeoutError(f"Failed to publish after retries: {req}")
TimeoutError: Failed to publish after retries: pub_messages {
  channel_type: RAY_ERROR_INFO_CHANNEL
  key_id: "ffffffff"
  error_info_message {
    job_id: "\377\377\377\377"
    type: "raylet_died"
    error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n    [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n    [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n    [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n    [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n    [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n    [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n    [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n    [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n    [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n    [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n    [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n    [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n    [state-dump] DebugString() time ms: 0\n    [state-dump] \n    [state-dump] \n"
    timestamp: 1671544933.6033757
  }
}


  1. It looks like the wheel i built caused something wrong within GCS ? To make sure these are not related to the source code which i had changed, I follow the build wheel workflow to rebuild one directly use the original branch releases/2.0.0. Same error occured as mentioned above.

Is there something wrong with me? I can only find how to install the build process after modifying the source code in the official document, but I can’t find the build and release processes. Maybe something wrong in my build wheel workflow. If you have any idea, please let me know! Thank you!

Have a nice day!

@AndreKuu ,

for my understanding, you tried to build ray using the instructions from this doc page?

https://docs.ray.io/en/latest/ray-contribute/development.html

Do you need to produce a wheel file or can you just build from source locally using pip install -e . --verbose from the python folder?

Can you also try starting ray without dashboard to rule it out as an issue?
ray start --include-dashboard=false

My sense is that this is not related to the dashboard but somehow the built wheel did not build correctly.

Thank you for your response~ Thank you!
Yes, i want to produce a wheel file, not only build from source locally in instructionshttps://docs.ray.io/en/latest/ray-contribute/development.html , but cannot find document about the topic. I followed the doc in python/build-wheel-manylinux2014.sh in the repo to build wheels.

I also think the problem may be in the build stage. From the log, It seems that there is a problem in the dashboard and gcs health check. So i publish this post in the dashboard category.

I tried close dashboard, i became more confused about the log:

ray start --head --include-dashboard=false --block                     
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 10.9.2.41

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.9.2.41:6379'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')
  
  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:10001')
  
  To see the status of the cluster, use
    ray status
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop

--block
  This command will now block forever until terminated by a signal.
  Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Some Ray subprocesses exited unexpectedly:
  dashboard [exit code=255]

Remaining processes will be killed.

Still find the error log about check GCS heath in /tmp/ray/session_latest/logs/dashboard.log

2022-12-21 14:40:49,823 ERROR head.py:165 -- Failed to check gcs health, client timed out.

Shouldn’t the dashboard be closed by --include-dashboard=false?

Can you share the gcs_server.out and gcs_server.err?

The size difference should be okay. The official wheel is built using the manylinux base image instead of ubuntu, the manylinux image is using old version of linux/centos.

HI, i am so happy to get your help! I think i have one point needed to be supplemented is that: i built linux wheels by this command
docker run -e TRAVIS_COMMIT=ee0fedab01d8371097cd96dd1d223f8c4c380a99 --rm -w /ray -v pwd:/ray -ti quay.io/pypa/manylinux2014_x86_64 /ray/python/build-wheel-manylinux2014.sh
I only change the environment TRAVIS_COMMIT to my latest commit. And i think the base image is also base on centos, here is the base image’s os details:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

I didn’t find any valuable logs in gcs_server.xxx so i did not post them before. gcs_server.err is empty. Here is gcs_server.out

[2022-12-22 11:55:51,461 I 45363 45363] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-12-22 11:55:51,462 I 45363 45363] (gcs_server) gcs_server.cc:60: GCS storage type is memory
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2022-12-22 11:55:51,463 I 45363 45363] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2022-12-22 11:55:51,467 I 45363 45363] (gcs_server) grpc_server.cc:120: GcsServer server started, listening on port 6379.
[2022-12-22 11:55:51,477 I 45363 45363] (gcs_server) gcs_server.cc:193: GcsNodeManager: 
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GrpcBasedResourceBroadcaster:
- Tracked nodes: 0
[2022-12-22 11:55:51,477 I 45363 45363] (gcs_server) gcs_server.cc:788: Event stats:


Global stats: 13 total (4 active)
Queueing time: mean = 2.233 ms, max = 14.455 ms, min = 2.414 us, total = 29.024 ms
Execution time:  mean = 1.117 ms, total = 14.522 ms
Event stats:
	GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 2.417 ms, total = 14.504 ms
	PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 3.854 us, total = 15.417 us
	RaySyncer.deadline_timer.report_resource_report - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	GcsInMemoryStore.Put - 1 total (0 active), CPU time: mean = 2.865 us, total = 2.865 us


[2022-12-22 11:55:53,472 I 45363 45363] (gcs_server) gcs_node_manager.cc:42: Registering node info, node id = cc6d749c7003983d080234c6f543dc314db6cd25216b2645acd0f82a, address = 10.9.2.41, node name = 10.9.2.41
[2022-12-22 11:55:53,472 I 45363 45363] (gcs_server) gcs_node_manager.cc:48: Finished registering node info, node id = cc6d749c7003983d080234c6f543dc314db6cd25216b2645acd0f82a, address = 10.9.2.41, node name = 10.9.2.41
[2022-12-22 11:55:53,472 I 45363 45363] (gcs_server) gcs_placement_group_manager.cc:760: A new node: cc6d749c7003983d080234c6f543dc314db6cd25216b2645acd0f82a registered, will try to reschedule all the infeasible placement groups.
[2022-12-22 11:55:53,482 I 45363 45363] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2022-12-22 11:55:53,482 I 45363 45363] (gcs_server) gcs_job_manager.cc:155: Finished getting all job info.
[2022-12-22 11:56:51,477 I 45363 45363] (gcs_server) gcs_server.cc:193: GcsNodeManager: 
- RegisterNode request count: 1
- DrainNode request count: 0
- GetAllNodeInfo request count: 30
- GetInternalConfig request count: 1

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 1
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 12

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GrpcBasedResourceBroadcaster:
- Tracked nodes: 1
[2022-12-22 11:56:51,477 I 45363 45363] (gcs_server) gcs_server.cc:788: Event stats:


Global stats: 2844 total (4 active)
Queueing time: mean = 19.851 us, max = 14.455 ms, min = 531.000 ns, total = 56.457 ms
Execution time:  mean = 23.068 us, total = 65.607 ms
Event stats:
	RaySyncer.deadline_timer.report_resource_report - 600 total (1 active), CPU time: mean = 9.296 us, total = 5.577 ms
	ResourceUpdate - 572 total (0 active), CPU time: mean = 14.146 us, total = 8.091 ms
	NodeManagerService.grpc_client.RequestResourceReport - 572 total (0 active), CPU time: mean = 27.023 us, total = 15.457 ms
	NodeManagerService.grpc_client.UpdateResourceUsage - 571 total (0 active), CPU time: mean = 9.686 us, total = 5.531 ms
	GcsInMemoryStore.Put - 97 total (0 active), CPU time: mean = 36.890 us, total = 3.578 ms
	InternalKVGcsService.grpc_server.InternalKVPut - 95 total (0 active), CPU time: mean = 22.607 us, total = 2.148 ms
	GcsInMemoryStore.Get - 67 total (0 active), CPU time: mean = 31.403 us, total = 2.104 ms
	InternalKVGcsService.grpc_server.InternalKVGet - 66 total (0 active), CPU time: mean = 18.598 us, total = 1.227 ms
	RayletLoadPulled - 60 total (1 active), CPU time: mean = 46.787 us, total = 2.807 ms
	NodeManagerService.grpc_client.GetResourceLoad - 58 total (0 active), CPU time: mean = 8.993 us, total = 521.623 us
	NodeInfoGcsService.grpc_server.GetAllNodeInfo - 30 total (0 active), CPU time: mean = 40.488 us, total = 1.215 ms
	HealthCheck - 18 total (0 active), CPU time: mean = 1.747 us, total = 31.441 us
	NodeResourceInfoGcsService.grpc_server.GetAllResourceUsage - 12 total (0 active), CPU time: mean = 61.047 us, total = 732.568 us
	GcsInMemoryStore.GetAll - 7 total (0 active), CPU time: mean = 2.077 ms, total = 14.537 ms
	GCSServer.deadline_timer.debug_state_dump - 6 total (1 active), CPU time: mean = 246.355 us, total = 1.478 ms
	PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 74.406 us, total = 297.626 us
	ActorInfoGcsService.grpc_server.GetAllActorInfo - 1 total (0 active), CPU time: mean = 28.121 us, total = 28.121 us
	NodeInfoGcsService.grpc_server.RegisterNode - 1 total (0 active), CPU time: mean = 95.350 us, total = 95.350 us
	InternalKVGcsService.grpc_server.InternalKVDel - 1 total (0 active), CPU time: mean = 12.308 us, total = 12.308 us
	JobInfoGcsService.grpc_server.GetAllJobInfo - 1 total (0 active), CPU time: mean = 32.206 us, total = 32.206 us
	GCSServer.deadline_timer.debug_state_event_stats_print - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s
	GcsInMemoryStore.Delete - 1 total (0 active), CPU time: mean = 18.186 us, total = 18.186 us
	NodeInfoGcsService.grpc_server.GetInternalConfig - 1 total (0 active), CPU time: mean = 14.942 us, total = 14.942 us
	GcsHealthCheckManager::AddNode - 1 total (0 active), CPU time: mean = 5.940 us, total = 5.940 us
	NodeInfoGcsService.grpc_server.CheckAlive - 1 total (0 active), CPU time: mean = 66.211 us, total = 66.211 us


[2022-12-22 11:56:53,356 I 45363 45363] (gcs_server) gcs_job_manager.cc:149: Getting all job info.
[2022-12-22 11:56:53,356 I 45363 45363] (gcs_server) gcs_job_manager.cc:155: Finished getting all job info.
[2022-12-22 11:57:51,478 I 45363 45363] (gcs_server) gcs_server.cc:193: GcsNodeManager: 
- RegisterNode request count: 1
- DrainNode request count: 0
- GetAllNodeInfo request count: 55
- GetInternalConfig request count: 1

GcsActorManager: 
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 1
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0

GcsResourceManager: 
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 24

GcsPlacementGroupManager: 
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GrpcBasedResourceBroadcaster:
- Tracked nodes: 1
[2022-12-22 11:57:51,478 I 45363 45363] (gcs_server) gcs_server.cc:788: Event stats:


Global stats: 5697 total (4 active)
Queueing time: mean = 14.802 us, max = 14.455 ms, min = 531.000 ns, total = 84.324 ms
Execution time:  mean = 20.536 us, total = 116.992 ms
Event stats:
	RaySyncer.deadline_timer.report_resource_report - 1200 total (1 active), CPU time: mean = 9.233 us, total = 11.080 ms
	NodeManagerService.grpc_client.RequestResourceReport - 1163 total (0 active), CPU time: mean = 27.449 us, total = 31.923 ms
	ResourceUpdate - 1163 total (0 active), CPU time: mean = 14.074 us, total = 16.368 ms
	NodeManagerService.grpc_client.UpdateResourceUsage - 1162 total (0 active), CPU time: mean = 9.476 us, total = 11.011 ms
	GcsInMemoryStore.Put - 183 total (0 active), CPU time: mean = 37.067 us, total = 6.783 ms
	InternalKVGcsService.grpc_server.InternalKVPut - 181 total (0 active), CPU time: mean = 24.013 us, total = 4.346 ms
	GcsInMemoryStore.Get - 126 total (0 active), CPU time: mean = 30.756 us, total = 3.875 ms
	InternalKVGcsService.grpc_server.InternalKVGet - 125 total (0 active), CPU time: mean = 18.665 us, total = 2.333 ms
	RayletLoadPulled - 120 total (1 active), CPU time: mean = 47.996 us, total = 5.759 ms
	NodeManagerService.grpc_client.GetResourceLoad - 118 total (0 active), CPU time: mean = 8.690 us, total = 1.025 ms
	NodeInfoGcsService.grpc_server.GetAllNodeInfo - 55 total (0 active), CPU time: mean = 40.147 us, total = 2.208 ms
	HealthCheck - 38 total (0 active), CPU time: mean = 1.694 us, total = 64.385 us
	NodeResourceInfoGcsService.grpc_server.GetAllResourceUsage - 24 total (0 active), CPU time: mean = 63.621 us, total = 1.527 ms
	GCSServer.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 268.578 us, total = 3.223 ms
	GcsInMemoryStore.GetAll - 8 total (0 active), CPU time: mean = 1.820 ms, total = 14.561 ms
	PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 74.406 us, total = 297.626 us
	GCSServer.deadline_timer.debug_state_event_stats_print - 2 total (1 active, 1 running), CPU time: mean = 113.425 us, total = 226.850 us
	GcsInMemoryStore.Keys - 2 total (0 active), CPU time: mean = 17.473 us, total = 34.946 us
	JobInfoGcsService.grpc_server.GetAllJobInfo - 2 total (0 active), CPU time: mean = 41.393 us, total = 82.786 us
	InternalKVGcsService.grpc_server.InternalKVKeys - 2 total (0 active), CPU time: mean = 10.258 us, total = 20.515 us
	NodeInfoGcsService.grpc_server.RegisterNode - 1 total (0 active), CPU time: mean = 95.350 us, total = 95.350 us
	ActorInfoGcsService.grpc_server.GetAllActorInfo - 1 total (0 active), CPU time: mean = 28.121 us, total = 28.121 us
	InternalKVGcsService.grpc_server.InternalKVDel - 1 total (0 active), CPU time: mean = 12.308 us, total = 12.308 us
	GcsInMemoryStore.Delete - 1 total (0 active), CPU time: mean = 18.186 us, total = 18.186 us
	NodeInfoGcsService.grpc_server.GetInternalConfig - 1 total (0 active), CPU time: mean = 14.942 us, total = 14.942 us
	GcsHealthCheckManager::AddNode - 1 total (0 active), CPU time: mean = 5.940 us, total = 5.940 us
	NodeInfoGcsService.grpc_server.CheckAlive - 1 total (0 active), CPU time: mean = 66.211 us, total = 66.211 us


Thank you for your help, simon~

Let me add some more details:

  1. Although the dashboard will exited unexpectly with logs as metioned before, When the head node started , i can still submit tasks and execute them. I can find the job info in dashboard Web UI but cannot get logs.
  2. Before ray main process exited, i can start some worker node connected to the head node until three minutes later the dashboard exited as metioned before.

Hmm this seems like gcs server is/was healthy for sometimes but the dashboard process was unable to connect to it for health check. Few things to check:

  • How was the CPU load on your machine? The health check could be delayed there.
  • How about file descriptor limits? Is the client able to pick a port and successfully execute the network request?

I do wonder if this is machine dependent. Can you try the wheel on a remote ubuntu server instead of your laptop?

Good advise! I use a remote ubuntu server to install the wheel, however, the same problem :smiling_face_with_tear:

Here is remote ubuntu server os-release :

NAME="Ubuntu"
VERSION="16.04.7 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.7 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

I thought that may be the cause of the problem is not the machine. I uninstalled my wheel and installed offical wheel, it works well. Install back my wheel, problem happened again. :smiling_face_with_tear: So i guess may not caused by CPU load or file descriptor.

Here is the ubuntu server CPU load, very low:

Tasks: 579 total,   1 running, 578 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.1 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26404118+total, 16047435+free, 15945292 used, 87621536 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 24548905+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                            
 5087 nobody    20   0  721844  30804  11660 S   9.9  0.0 622:08.34 pushgateway
 5113 nobody    20   0 24.431g 519012 300180 S   2.0  0.2 131:08.61 prometheus
 1840 root      20   0 2616940  55868  28812 S   0.7  0.0  38:45.41 containerd
 5374 guozhec+  20   0 1724964 245804  38516 S   0.7  0.1  23:32.45 mongod
23416 root      20   0 3657076 226408  64364 S   0.7  0.1   0:22.46 gunicorn
31502 qiuhanw+  20   0   51660   4092   3152 R   0.7  0.0   0:00.68 top                                                                                                                
    8 root      20   0       0      0      0 S   0.3  0.0   5:51.65 rcu_sched
 1906 root      20   0 4773800 124836  50780 S   0.3  0.0  19:11.25 dockerd
 5066 guozhec+  20   0  795240  82440  54156 S   0.3  0.0   6:45.05 grafana-server
 5120 65535     20   0  726328  28820  13264 S   0.3  0.0  57:16.44 mongodb_exporte

File descriptor limit

(ray) qiuhanwen@172-16-10-61[19:48:58]:/tmp/ray/session_latest/logs$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1031259
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1031259
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Is the client able to pick a port and successfully execute the network request? I’m not sure the possible true meaning of this question. Do you mean check the machine or a ray client? The server can execute the network by using curl to confirm this.

Thanks you simon. Thank for your support. This looks a little tricky. Maybe you can give my some advise on the ray team how to build a offical wheel. Using the command:
docker run -e TRAVIS_COMMIT=ee0fedab01d8371097cd96dd1d223f8c4c380a99 --rm -w /ray -v pwd:/ray -ti quay.io/pypa/manylinux2014_x86_64 /ray/python/build-wheel-manylinux2014.sh
Is it all work, or some stuff need to do before?

I check the ci folder ci.sh,find no other complicated process related to building wheels. Why can’t I use what I built ? :joy: :sweat_smile:

Is there anything special about your commit? I can’t find the commit in our official github repo. Maybe there’s some modification to the source code that might led to this issue?

:smiley: I’ve thought about this before. So i checkout to offical tag branch ray-2.0.0 and built the wheel and install it also caused same problem.
So I always wonder which step of the build stage is wrong.
Yes, i fork the project to a private library not commit to the official github repo directly.
I’m going to change to an OS, such as Mac to check whether have the same problem.
Thanks simon.

I wonder that whether the Ray team will add some tutorials about build wheel in the related document like this:
https://docs.ray.io/en/latest/ray-contribute/development.html
I think that might help.
But I think my situation is a little strange. The results of the relatively simple links in theory do not meet expectations. It makes me despair

Anyway, thank you very much simon. Thanks for your support. Have a nice day.

Hi simon. Good news. The problem had been solved now. I changed another remote ubuntu server to build wheels, same command :
docker run -e TRAVIS_COMMIT=<my_commit_number> --rm -w /ray -v pwd:/ray -ti quay.io/pypa/manylinux2014_x86_64 /ray/python/build-wheel-manylinux2014.sh
, the problem disappeared. The wheel built works well. :sweat_smile: I’m full of question marks :rofl:

Two server is very similar, and i think that wheel is built in the docker container, the wheels should not be related to the host environment. But it did. I tried several times to confirm this. And no matter related or not , i check the three main port: 8265, 6379,10001, in remote server before i built wheels. And the base centos image is the same: quay.io/pypa/manylinux2014_x86_64:6634873779ef. :rofl:

Here are the two server os-release file:

NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
NAME="Ubuntu"
VERSION="16.04.7 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.7 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

The second one is built successfully, and the wheels built on the first one are unavailable.

It’s a strange ha?
One more detail is that the wheels which are not available can not be installed by the command
pip3 install .whl/ray-x.y.z-cp3x-cp3x-manylinux2014_x86_64.whl[default], but the normal one can.

I don’t know whether my experience with wheel can help the community or other people with the same problem, but there are so few problems that there is no experience. Maybe the community will find the root cause in the future.

Thanks simon~ Have a nice day.

I have to overturn my above statement. The problem not actually came from the build machine.

I think it is caused by the modification of version.
In my business scenario, I need to modify some source code to generate a new distribution release for the project. For example change line 107 in python/ray/__init__.py to __version__ = "3.0.0.1" and push it to the private library.

I grep the version in the repo and find out that line 55 in src/ray/common/constants.h :
constexpr char kRayVersion[] = "3.0.0.dev0";
I change the version same with python/ray/__init__.py and built wheel again. The problem is solved.

Firstly, to be honest, I’m not sure that this change is enough. if this issue caused by these two versions needs to be consistent. Is my modify correct? Is there anything missing?

Secondly, if the issue is just related to these two differences. How about making a proposal ?
proposal:

  • Is it necessary and correct to modify the build script to keep the version varible in head file src/ray/common/constants.h same with the version in python/ray in the build phase/__ init__. py while building ?
  • Or the right way is add some related error logs when this issue happend. In this process, the problem cannot be located clearly because there is no correct log.

I reopen the issue in the github. [Dashboard] Head node exited unexceptly because of dashboard process exited · Issue #31261 · ray-project/ray · GitHub @simon-mo I hope to get your opinion.