Hello guys, have a nice day!
I have a problem within building ray python wheels. I have tried every method i can thought to solve this problem before asking help from community. Hope the community can give some suggestions or inspiration.
-
I built my version ray wheel on my Ubuntu laptop after modify some source code in the dashboard module, job related function, in order to add a feature: persistent job info in JobInfoStorageClient.
I followed the doc in
python/build-wheel-manylinux2014.sh
to build wheels. I found the size of wheel i build is smaller than the offical release version a lot. My branch is base onreleases/2.0.0
. Here are the wheel i build on my laptop:
INFO .whl/ray-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (29.1 MB)
INFO .whl/ray-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (29.0 MB)
INFO .whl/ray-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (29.1 MB)
INFO .whl/ray-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (29.1 MB)
INFO .whl/ray-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (29.1 MB)
INFO .whl/ray_cpp-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (21.3 MB)
INFO .whl/ray_cpp-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (21.7 MB)
INFO .whl/ray_cpp-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (21.3 MB)
INFO .whl/ray_cpp-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (21.3 MB)
INFO .whl/ray_cpp-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (21.3 MB)
However, the offical release version is bigger than mine.For example, python3.8 + linux + ray2.0.0 is Downloading ray-2.0.0-cp38-cp38-manylinux2014_x86_64.whl (59.2 MB)
Is this normal ?
- when i install the ray wheel which i built by the command
ray start --head --dashboard-host 0.0.0.0 --dashboard-port 8265 --block
, Strange things happened. After about 3 minutes, the process exited. Here is the stdout:
...
Some Ray subprocesses exited unexpectedly:
dashboard [exit code=255]
Remaining processes will be killed.
- I found these log in
session_latest/logs/dashboard.log
...
2022-12-20 21:59:21,240 INFO http_server_head.py:142 -- Registered 51 routes.
2022-12-20 21:59:21,242 INFO datacenter.py:70 -- Purge data.
2022-12-20 21:59:21,242 INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.0622056 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['GCS'].
2022-12-20 21:59:21,244 INFO usage_stats_head.py:102 -- Usage reporting is enabled.
2022-12-20 21:59:21,244 INFO actor_head.py:111 -- Getting all actor info from GCS.
2022-12-20 21:59:21,246 INFO actor_head.py:137 -- Received 0 actor info from GCS.
2022-12-20 21:59:32,244 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 21:59:48,245 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:04,248 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:20,252 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:36,255 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:52,257 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:08,260 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:24,263 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:40,267 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:56,270 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:138 -- Dashboard exiting because it received too many GCS RPC errors count: 11, threshold is 10.
session_latest/logs/dashboard_agent.log
...
2022-12-20 21:59:23,084 INFO event_agent.py:46 -- Report events to 10.9.2.41:34684
2022-12-20 21:59:23,084 INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.9415762 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['COMMON', 'CORE_WORKER', 'RAYLET'].
2022-12-20 22:02:13,087 ERROR reporter_agent.py:809 -- Error publishing node physical stats.
Traceback (most recent call last):
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 806, in _perform_iteration
await publisher.publish_resource_usage(self._key, jsonify_asdict(stats))
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 452, in publish_resource_usage
await self._stub.GcsPublish(req)
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1671544933.087241918","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1671544933.087241207","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-12-20 22:02:13,602 ERROR agent.py:217 -- Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] NodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms
[state-dump] NodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms
[state-dump] NodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us
[state-dump] NodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms
[state-dump] NodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms
[state-dump] NodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms
[state-dump] NodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms
[state-dump] PeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us
[state-dump] NodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms
[state-dump] AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us
[state-dump] JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us
[state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
2022-12-20 22:03:13,680 ERROR utils.py:224 -- Failed to publish error job_id: "\377\377\377\377"
type: "raylet_died"
error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n [state-dump] DebugString() time ms: 0\n [state-dump] \n [state-dump] \n"
timestamp: 1671544933.6033757
Traceback (most recent call last):
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/utils.py", line 222, in publish_error_to_driver
gcs_publisher.publish_error(job_id.hex().encode(), error_data)
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 169, in publish_error
self._gcs_publish(req)
File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 191, in _gcs_publish
raise TimeoutError(f"Failed to publish after retries: {req}")
TimeoutError: Failed to publish after retries: pub_messages {
channel_type: RAY_ERROR_INFO_CHANNEL
key_id: "ffffffff"
error_info_message {
job_id: "\377\377\377\377"
type: "raylet_died"
error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n [state-dump] DebugString() time ms: 0\n [state-dump] \n [state-dump] \n"
timestamp: 1671544933.6033757
}
}
- It looks like the wheel i built caused something wrong within GCS ? To make sure these are not related to the source code which i had changed, I follow the build wheel workflow to rebuild one directly use the original branch
releases/2.0.0
. Same error occured as mentioned above.
Is there something wrong with me? I can only find how to install the build process after modifying the source code in the official document, but I can’t find the build and release processes. Maybe something wrong in my build wheel workflow. If you have any idea, please let me know! Thank you!
Have a nice day!