How severe does this issue affect your experience of using Ray?
- Low - Medium: It annoys or frustrates me for a moment.
Hello all, I am running a Ray program on a SLURM cluster and am running into the following error:
e[2me[36m(_execute_function_on_list pid=149272)e[0m *** SIGBUS received at time=1691850113 on cpu 53 ***
e[2me[36m(_execute_function_on_list pid=149272)e[0m PC: @ 0x2b701c13fe04 (unknown) boost::CV::simple_exception_policy<>::on_error()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701459a630 3456 (unknown)
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701ca31e89 64 grpc_core::ClientAuthFilter::GetCallCredsMetadata()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701c9fe8ff 176 grpc_core::Subchannel::HealthWatcherMap::HealthWatcher::Orphan()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701c9c6a3f 240 _vpaes_encrypt_core_2x
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701c6d2961 64 grpc::ClientReader<>::Read()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701c2fa55c 224 absl::lts_20220623::container_internal::raw_hash_set<>::destroy_slots()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ 0x2b701cb89830 (unknown) google::protobuf::io::Tokenizer::ParseStringAppend()
e[2me[36m(_execute_function_on_list pid=149272)e[0m @ ... and at least 3 more frames
It hasn’t caused my program to fail, so it isn’t a blocker, but I’m worried that it is the start of a failure later on. One other thing I noticed in the error logs is that there is a lot of Object Spilling but none of the spilled objects (800 GB worth) have been restored. I assume the spilled objects are the return values for my remote function which I process using ray.wait
one at a time.
Any help is appreciated! Thanks in advance!