In an attempt to address Can Tune be configured to not keep paused trials in object store? - #4 by matthew.cox , I’ve tried running on 2.0.0.dev nightly, as it appears that object spilling may address my memory issues. However, when doing so I get the following error the first time an object is spilled:
2021-02-12 17:48:30,412 WARNING worker.py:1109 -- Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 637, in ray._raylet.spill_objects_handler
File "python/ray/_raylet.pyx", line 640, in ray._raylet.spill_objects_handler
File "/opt/conda/lib/python3.8/site-packages/ray/external_storage.py", line 432, in spill_objects
return _external_storage.spill_objects(object_refs, owner_addresses)
File "/opt/conda/lib/python3.8/site-packages/ray/external_storage.py", line 244, in spill_objects
return self._write_multiple_objects(f, object_refs,
File "/opt/conda/lib/python3.8/site-packages/ray/external_storage.py", line 124, in _write_multiple_objects
buf_len = len(buf)
TypeError: object of type 'NoneType' has no len()
An unexpected internal error occurred while the IO worker was spilling objects.
(raylet) [2021-02-12 17:48:30,402 E 371 371] logging.cc:435: *** Aborted at 1613152110 (unix time) try "date -d @1613152110" if you are using GNU date ***
(raylet) [2021-02-12 17:48:30,403 E 371 371] logging.cc:435: PC: @ 0x0 (unknown)
(raylet) [2021-02-12 17:48:30,403 E 371 371] logging.cc:435: *** SIGSEGV (@0x8) received by PID 371 (TID 0x7fccaa19d800) from PID 8; stack trace: ***
(raylet) [2021-02-12 17:48:30,403 E 371 371] logging.cc:435: @ 0x55e8c6e8643f google::(anonymous namespace)::FailureSignalHandler()
(raylet) [2021-02-12 17:48:30,403 E 371 371] logging.cc:435: @ 0x7fcca9d74980 (unknown)
(raylet) [2021-02-12 17:48:30,404 E 371 371] logging.cc:435: @ 0x55e8c71e9e27 google::protobuf::RepeatedPtrField<>::Get()
(raylet) [2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6ab38f5 ray::raylet::LocalObjectManager::AddSpilledUrls()
(raylet) [2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6ab50ee _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc17SpillObjectsReplyEEZZNS0_6raylet18LocalObjectManager20SpillObjectsInternalERKSt6vectorINS0_8ObjectIDESaISC_EESt8functionIFvS3_EEENKUlSt10shared_ptrINS9_15WorkerInterfaceEEE_clESM_EUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
(raylet) [2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6abd1f2 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet) [2021-02-12 17:48:30,406 E 371 371] logging.cc:435: @ 0x55e8c6acf072 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(raylet) [2021-02-12 17:48:30,408 E 371 371] logging.cc:435: @ 0x55e8c71fc381 boost::asio::detail::scheduler::do_run_one()
(raylet) [2021-02-12 17:48:30,409 E 371 371] logging.cc:435: @ 0x55e8c71fda29 boost::asio::detail::scheduler::run()
(raylet) [2021-02-12 17:48:30,410 E 371 371] logging.cc:435: @ 0x55e8c71fff17 boost::asio::io_context::run()
(raylet) [2021-02-12 17:48:30,410 E 371 371] logging.cc:435: @ 0x55e8c6a2a2a2 main
(raylet) [2021-02-12 17:48:30,411 E 371 371] logging.cc:435: @ 0x7fcca8e4bbf7 __libc_start_main
(raylet) [2021-02-12 17:48:30,412 E 371 371] logging.cc:435: @ 0x55e8c6a3f395 (unknown)
Tail from logs/raylet.out
:
[2021-02-12 17:48:14,667 I 371 381] store.cc:230: Object store current usage 18.8078 / 19.2431 GB.
[2021-02-12 17:48:25,900 I 371 371] node_manager.cc:503: Sending Python GC request to 2 local workers to clean up Python cyclic references.
[2021-02-12 17:48:27,381 I 371 371] local_object_manager.cc:170: Spilled 1379 MiB, 1 objects, write throughput 891 MiB/s
[2021-02-12 17:48:28,934 I 371 371] local_object_manager.cc:170: Spilled 2759 MiB, 2 objects, write throughput 890 MiB/s
[2021-02-12 17:48:30,401 I 371 371] local_object_manager.cc:170: Spilled 4139 MiB, 3 objects, write throughput 906 MiB/s
[2021-02-12 17:48:30,402 E 371 371] logging.cc:435: *** Aborted at 1613152110 (unix time) try "date -d @1613152110" if you are using GNU date ***
[2021-02-12 17:48:30,403 E 371 371] logging.cc:435: PC: @ 0x0 (unknown)
[2021-02-12 17:48:30,403 E 371 371] logging.cc:435: *** SIGSEGV (@0x8) received by PID 371 (TID 0x7fccaa19d800) from PID 8; stack trace: ***
[2021-02-12 17:48:30,403 E 371 371] logging.cc:435: @ 0x55e8c6e8643f google::(anonymous namespace)::FailureSignalHandler()
[2021-02-12 17:48:30,403 E 371 371] logging.cc:435: @ 0x7fcca9d74980 (unknown)
[2021-02-12 17:48:30,404 E 371 371] logging.cc:435: @ 0x55e8c71e9e27 google::protobuf::RepeatedPtrField<>::Get()
[2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6ab38f5 ray::raylet::LocalObjectManager::AddSpilledUrls()
[2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6ab50ee _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc17SpillObjectsReplyEEZZNS0_6raylet18LocalObjectManager20SpillObjectsInternalERKSt6vectorINS0_8ObjectIDES
aISC_EESt8functionIFvS3_EEENKUlSt10shared_ptrINS9_15WorkerInterfaceEEE_clESM_EUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
[2021-02-12 17:48:30,405 E 371 371] logging.cc:435: @ 0x55e8c6abd1f2 ray::rpc::ClientCallImpl<>::OnReplyReceived()
[2021-02-12 17:48:30,406 E 371 371] logging.cc:435: @ 0x55e8c6acf072 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERK
NS_6system10error_codeEm
[2021-02-12 17:48:30,408 E 371 371] logging.cc:435: @ 0x55e8c71fc381 boost::asio::detail::scheduler::do_run_one()
[2021-02-12 17:48:30,409 E 371 371] logging.cc:435: @ 0x55e8c71fda29 boost::asio::detail::scheduler::run()
[2021-02-12 17:48:30,410 E 371 371] logging.cc:435: @ 0x55e8c71fff17 boost::asio::io_context::run()
[2021-02-12 17:48:30,410 E 371 371] logging.cc:435: @ 0x55e8c6a2a2a2 main
[2021-02-12 17:48:30,411 E 371 371] logging.cc:435: @ 0x7fcca8e4bbf7 __libc_start_main
[2021-02-12 17:48:30,412 E 371 371] logging.cc:435: @ 0x55e8c6a3f395 (unknown)
And looking at the ray spilled objects directory:
root@ip-172-31-79-202:/tmp/ray/session_latest# ls -l ray_spilled_objects/
total 4238568
-rw-r--r-- 1 root root 0 Feb 12 17:48 0fb939667ee64409419d38867099fbd841ba16210100000001000000-multi-8
-rw-r--r-- 1 root root 1446757125 Feb 12 17:48 11ac9524461283ce753565f917242c119e81ccf70100000001000000-multi-1
-rw-r--r-- 1 root root 1446757125 Feb 12 17:48 59f3350af995668a764f8475f05bf001226e22c30100000001000000-multi-1
-rw-r--r-- 1 root root 0 Feb 12 17:48 59f76feddb5a7dbcd3ec2e91753e434062462c210100000001000000-multi-5
-rw-r--r-- 1 root root 1446757189 Feb 12 17:48 609d7f556b6757ad480a853c2c4c6f276de42d010100000001000000-multi-1
I looked through current GitHub issues and didn’t see anything relevant. Is this a known issue? If not, is there any further information that I can provide to help diagnose what is going on?