[Ray Data] error with read_parquet from hdfs

Hi team,

i am periodically getting an error while doing ray.data.read_parquet from hdfs.
I am working with ray train and this completely fails the job (max failures in run_config doesn’t help)

dataset = ray.data.read_parquet("hdfs://...")

trainer = TorchTrainer(
     run_config=RunConfig(failure_config=FailureConfig(max_failures=1000))
     dataset_config={
"train": DatasetConfig(
                    required=True,
                    fit=False,
                    transform=True,
                    split=True,
                    max_object_store_memory_fraction=max_object_store_memory_fraction,
                    randomize_block_order=True,
                )
},
datasets={"train": dataset}

)
     

Any hints?

[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361: PC: @     0x7f15a8181574  (unknown)  ObjectMonitor::enter()
[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f16171a6420       3536  (unknown)
[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7edee0d         80  InterpreterRuntime::monitoren
ter()
[2023-04-04 07:21:15,164 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce8740d        112  (unknown)
[2023-04-04 07:21:15,170 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1        216  (unknown)
[2023-04-04 07:21:15,175 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1         88  (unknown)
[2023-04-04 07:21:15,180 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1        120  (unknown)
[2023-04-04 07:21:15,186 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce66f20        120  (unknown)
[2023-04-04 07:21:15,191 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce66f20         88  (unknown)
[2023-04-04 07:21:15,229 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce67206        176  (unknown)
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce5f50b        120  (unknown)
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eeb194        384  JavaCalls::call_helper()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eec6d7        224  JavaCalls::call_virtual()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eecc10        160  JavaCalls::call_virtual()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7f87da1        128  thread_entry()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a82fb425        176  JavaThread::thread_main_inner
()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a81a3002        864  java_start()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f161719a609  (unknown)  start_thread
Fatal Python error: Segmentation fault```

Can you check the logs? There might be trace of what caused the SEGV.

Can you just try loading data first (without train) for now?

The stacktrace suggests problem in invoking JVM, so suspecting reading from hdfs failing.

hi
thanks for quick response

@Jules_Damji i see only

[igolant@krylov-devour-user-deploy-94-845f5fc5f8-z8t9r c5939efe-210a-4f1f-b8c1-93557fb81377]$ less repro/stdout-index-0-attempt-1.log

#  SIGSEGV (0xb) at pc=0x00007f2c185572ab, pid=2017, tid=0x00007f2bc0268700
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.54.0.22-SA-linux64) (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libpthread.so.0+0x142ab]  raise+0xcb
#
# Core dump written. Default location: ....
#
# An error report file with more information is saved as:
# 
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/

@xwjiang2010
my dataset is too big to fit in object store - so i use streaming (window) mode

got you. How is streaming enabled? I cannot see that from the code you pasted. What ray version are you using?
Could you try streaming ingest + training with a dummy trainer (that basically does nothing other than data pass through)? Want to see isolate training with data ingestion problems.

@xwjiang2010

i specify max_object_store_memory_fraction in DatasetConfig - internally he creates dataset pipeline.

it fails after several hours (not very predictable) - dont think it really is related to TorchTrainer that i am using

basically, my main problem is not failure itself but that retry doesnt work

@chengsu can you help here?

Actually I am not entirely sure how segv is handled by ray. Where does this segv happen? Does it happen in head node or worker nodes? Which process causes it?

hi @xwjiang2010
it happens on worker node when it executes read_task
pyarrow hdfs is based on libhdfs library - so it tries to create jvm and fails

cc @amogkam @sangcho to help out