SIGSEGV segmentation fault when using HadoopFileSystem in RunConfig

Hi there, I’m having SIGSEGV (segmentation fault) error when instantiate RunConfig object with FSSPEC Hadoop filesystem. I wonder, how do I debug this error? Any help appreciated, please see my details below:

Env:

[root@/app #] ray --version
ray, version 2.8.0

[root@/app #] python --version
Python 3.9.2

[root@/app #] java -version
openjdk version "1.8.0_292"

[root@/app #] hadoop version
Hadoop 2.8.2
Compiled with protoc 2.5.0

Code:

def create_run_config(checkpoint_config: CheckpointConfig) -> RunConfig:

    result_dir_url = os.environ[ENV_RESULT_DIR_URL]
    log.info("result_dir_url: %r", result_dir_url)

    storage_filesystem, storage_path = fsspec.core.url_to_fs(result_dir_url)
    log.info("storage_path: %r", storage_path)
    log.info("storage_filesystem [0]: %r", storage_filesystem)

    storage_filesystem = PyFileSystem(FSSpecHandler(storage_filesystem))
    log.info("storage_filesystem [1]: %r", storage_filesystem)

    run_config_options = dict(
        checkpoint_config=checkpoint_config,
        storage_path=storage_path,
        storage_filesystem=storage_filesystem,
    )
    log.info("run_config_options: %r", run_config_options)
    log.info("create run_config ...")

    run_config = RunConfig(**run_config_options)

    log.info("run_config: %r", run_config)

    return run_config

Logs:

app.bert_cola.train |     INFO | result_dir_url: 'hdfs:///app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df'
app.bert_cola.train |     INFO | storage_path: '/app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df'
app.bert_cola.train |     INFO | storage_filesystem [0]: <fsspec.implementations.arrow.HadoopFileSystem object at 0x7f28a3d1dca0>
app.bert_cola.train |     INFO | storage_filesystem [1]: <pyarrow._fs.PyFileSystem object at 0x7f28385f16f0>
app.bert_cola.train |     INFO | run_config_options: {'checkpoint_config': CheckpointConfig(num_to_keep=2, checkpoint_score_attribute='matthews_correlation'), 'storage_path': '/app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df', 'storage_filesystem': <pyarrow._fs.PyFileSystem object at 0x7f28385f16f0>}
app.bert_cola.train |     INFO | create run_config ...
*** SIGSEGV received at time=1699909076 on cpu 50 ***
PC: @     0x7f2a88093871  (unknown)  __pyx_f_7pyarrow_3_fs__cb_equals()
    @     0x7f2a914eb140  (unknown)  (unknown)
    @     0x7f2a8fe64f53         96  arrow::py::SafeCallIntoPython<>()
    @     0x7f2a8fe67c85        112  arrow::fs::FileSystem::Equals()
    @     0x7f2a8809cc2a         96  __pyx_pw_7pyarrow_3_fs_10FileSystem_5equals()
    @     0x7f2a8808759a         64  __Pyx_PyObject_CallOneArg()
    @     0x7f2a8808c2b3        144  __pyx_pf_7pyarrow_3_fs_10FileSystem_6__eq__()
    @     0x7f2a8808c61d         48  __pyx_tp_richcompare_7pyarrow_3_fs_FileSystem()
    @           0x5337fb  (unknown)  PyObject_RichCompare
    @           0x906f40  (unknown)  (unknown)
[2023-11-13 20:57:56,488 E 17781 17781] logging.cc:361: *** SIGSEGV received at time=1699909076 on cpu 50 ***
[2023-11-13 20:57:56,488 E 17781 17781] logging.cc:361: PC: @     0x7f2a88093871  (unknown)  __pyx_f_7pyarrow_3_fs__cb_equals()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a914eb140  (unknown)  (unknown)
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8fe64f53         96  arrow::py::SafeCallIntoPython<>()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8fe67c85        112  arrow::fs::FileSystem::Equals()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8809cc2a         96  __pyx_pw_7pyarrow_3_fs_10FileSystem_5equals()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8808759a         64  __Pyx_PyObject_CallOneArg()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8808c2b3        144  __pyx_pf_7pyarrow_3_fs_10FileSystem_6__eq__()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @     0x7f2a8808c61d         48  __pyx_tp_richcompare_7pyarrow_3_fs_FileSystem()
[2023-11-13 20:57:56,489 E 17781 17781] logging.cc:361:     @           0x5337fb  (unknown)  PyObject_RichCompare
[2023-11-13 20:57:56,491 E 17781 17781] logging.cc:361:     @           0x906f40  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/usr/local/lib/python3.9/dist-packages/ray/air/config.py", line 78 in _repr_dataclass
  File "/usr/local/lib/python3.9/dist-packages/ray/air/config.py", line 659 in __repr__
  File "/usr/lib/python3.9/logging/__init__.py", line 363 in getMessage
  File "/usr/lib/python3.9/logging/__init__.py", line 659 in format
  File "/usr/lib/python3.9/logging/__init__.py", line 923 in format
  File "/usr/lib/python3.9/logging/__init__.py", line 1079 in emit
  File "/usr/lib/python3.9/logging/__init__.py", line 948 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1657 in callHandlers
  File "/usr/lib/python3.9/logging/__init__.py", line 1595 in handle
  File "/usr/lib/python3.9/logging/__init__.py", line 1585 in _log
  File "/usr/lib/python3.9/logging/__init__.py", line 1442 in info
  File "/app/app/bert_cola/train.py", line 127 in create_run_config
  File "/app/app/bert_cola/train.py", line 36 in train
  File "/app/sdk/decorators.py", line 35 in wrapper
  File "/app/sdk/run_task.py", line 66 in run
  File "/app/sdk/run_task.py", line 79 in main
  File "/app/sdk/run_task.py", line 105 in <module>
  File "/usr/lib/python3.9/runpy.py", line 87 in _run_code
  File "/usr/lib/python3.9/runpy.py", line 197 in _run_module_as_main

Hey @andrii, this should be fixed in a nightly version of Ray and will be included in Ray 2.9. The root cause comes from the repr of the filesystem.

In the meantime, you can manually override the repr with the change in the PR, or do something naive like this:

from ray.train import RunConfig

class MyRunConfig(RunConfig):
    def __repr__(self):
        return object.__repr__(self) 

run_config = MyRunConfig()
print(run_config)

Hi @matthewdeng thank you for quick reply. I tried MyRunConfig approach, it helped with log/repr problem, however, I still see SIGSEGV, only this time, it happens at training time.

Code:

run_config = create_run_config(checkpoint_config=CheckpointConfig(...))

log.info("create TorchTrainer ...")

trainer = TorchTrainer(
    train_loop_per_worker=train_loop,
    train_loop_config=train_loop_config,
    scaling_config=scaling_config,
    run_config=run_config,
    datasets={
        "train": train_data,
        "validation": validation_data,
    },
)
log.info("TorchTrainer: %r", trainer)
log.info("trainer.fit ...")
result = trainer.fit()

Log:

app.bert_cola.train |     INFO | result_dir_url: 'hdfs:///app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df'
app.bert_cola.train |     INFO | storage_path: '/app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df'
app.bert_cola.train |     INFO | storage_filesystem [0]: <fsspec.implementations.arrow.HadoopFileSystem object at 0x7f2b7f8f8e50>
app.bert_cola.train |     INFO | storage_filesystem [1]: <pyarrow._fs.PyFileSystem object at 0x7f2b7f8f2af0>
app.bert_cola.train |     INFO | run_config_options: {'checkpoint_config': CheckpointConfig(num_to_keep=2, checkpoint_score_attribute='matthews_correlation'), 'storage_path': '/app/xxx/workspaces/dc632718-6f85-4bed-8b96-109fa5eb70c1/app.bert_cola.train.train.cec122ada5cc44398cae54754a1281df', 'storage_filesystem': <pyarrow._fs.PyFileSystem object at 0x7f2b7f8f2af0>}
app.bert_cola.train |     INFO | create run_config ...
app.bert_cola.train |     INFO | run_config: <app.bert_cola.train.MyRunConfig object at 0x7f2af86f7b80>
app.bert_cola.train |     INFO | create TorchTrainer ...
2023-11-13 22:11:52,427	INFO plan.py:757 -- Using autodetected parallelism=136 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (68).
2023-11-13 22:11:52,428	INFO plan.py:762 -- To satisfy the requested parallelism of 136, each read task output is split into 136 smaller blocks.

[2m[36m(pid=417) [0mParquet Files Sample 0: 100%|██████████| 1/1 [00:01<00:00,  1.95s/it]

2023-11-13 22:11:52,464	INFO plan.py:757 -- Using autodetected parallelism=136 for stage ReadParquet to satisfy parallelism at least twice the available number of CPUs (68).
2023-11-13 22:11:52,464	INFO plan.py:762 -- To satisfy the requested parallelism of 136, each read task output is split into 136 smaller blocks.
app.bert_cola.train |     INFO | TorchTrainer: <TorchTrainer scaling_config=ScalingConfig(num_workers=2, use_gpu=True) run_config=<app.bert_cola.train.MyRunConfig object at 0x7f2af86f7b80> datasets={'train': Dataset(
   num_blocks=136,
   num_rows=8551,
   schema={
      input_ids: numpy.ndarray(shape=(128,), dtype=int64),
      token_type_ids: numpy.ndarray(shape=(128,), dtype=int64),
      attention_mask: numpy.ndarray(shape=(128,), dtype=int64),
      label: int64
   }
), 'validation': Dataset(
   num_blocks=136,
   num_rows=1043,
   schema={
      input_ids: numpy.ndarray(shape=(128,), dtype=int64),
      token_type_ids: numpy.ndarray(shape=(128,), dtype=int64),
      attention_mask: numpy.ndarray(shape=(128,), dtype=int64),
      label: int64
   }
)}>
app.bert_cola.train |     INFO | trainer.fit ...
2023-11-13 22:11:52,482	INFO tune.py:595 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
*** SIGSEGV received at time=1699913512 on cpu 45 ***
PC: @     0x7f2b7f0f7c13  (unknown)  Monitor::wait()
    @     0x7f2d6d04e140       3408  (unknown)
    @     0x7f2b7ec68882         48  CompileQueue::get()
    @     0x7f2b7ec71f2b        256  CompileBroker::compiler_thread_loop()
    @     0x7f2b7f2b54c7        160  JavaThread::thread_main_inner()
    @     0x7f2b7f2b681a        160  JavaThread::run()
    @     0x7f2b7f144132         64  java_start()
    @     0x7f2d6d042ea7  (unknown)  start_thread
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361: *** SIGSEGV received at time=1699913512 on cpu 45 ***
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361: PC: @     0x7f2b7f0f7c13  (unknown)  Monitor::wait()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2d6d04e140       3408  (unknown)
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2b7ec68882         48  CompileQueue::get()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2b7ec71f2b        256  CompileBroker::compiler_thread_loop()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2b7f2b54c7        160  JavaThread::thread_main_inner()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2b7f2b681a        160  JavaThread::run()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2b7f144132         64  java_start()
[2023-11-13 22:11:52,537 E 21142 21253] logging.cc:361:     @     0x7f2d6d042ea7  (unknown)  start_thread
Fatal Python error: Segmentation fault