SIGBUS Error on SLURM

How severe does this issue affect your experience of using Ray?

  • Low - Medium: It annoys or frustrates me for a moment.

Hello all, I am running a Ray program on a SLURM cluster and am running into the following error:

e[2me[36m(_execute_function_on_list pid=149272)e[0m *** SIGBUS received at time=1691850113 on cpu 53 ***
e[2me[36m(_execute_function_on_list pid=149272)e[0m PC: @     0x2b701c13fe04  (unknown)  boost::CV::simple_exception_policy<>::on_error()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701459a630       3456  (unknown)
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701ca31e89         64  grpc_core::ClientAuthFilter::GetCallCredsMetadata()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701c9fe8ff        176  grpc_core::Subchannel::HealthWatcherMap::HealthWatcher::Orphan()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701c9c6a3f        240  _vpaes_encrypt_core_2x
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701c6d2961         64  grpc::ClientReader<>::Read()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701c2fa55c        224  absl::lts_20220623::container_internal::raw_hash_set<>::destroy_slots()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @     0x2b701cb89830  (unknown)  google::protobuf::io::Tokenizer::ParseStringAppend()
e[2me[36m(_execute_function_on_list pid=149272)e[0m     @ ... and at least 3 more frames

It hasn’t caused my program to fail, so it isn’t a blocker, but I’m worried that it is the start of a failure later on. One other thing I noticed in the error logs is that there is a lot of Object Spilling but none of the spilled objects (800 GB worth) have been restored. I assume the spilled objects are the return values for my remote function which I process using ray.wait one at a time.

Any help is appreciated! Thanks in advance!