[Ray Data] error with read_parquet from hdfs

igolant · April 4, 2023, 9:25am

Hi team,

i am periodically getting an error while doing ray.data.read_parquet from hdfs.
I am working with ray train and this completely fails the job (max failures in run_config doesn’t help)

dataset = ray.data.read_parquet("hdfs://...")

trainer = TorchTrainer(
     run_config=RunConfig(failure_config=FailureConfig(max_failures=1000))
     dataset_config={
"train": DatasetConfig(
                    required=True,
                    fit=False,
                    transform=True,
                    split=True,
                    max_object_store_memory_fraction=max_object_store_memory_fraction,
                    randomize_block_order=True,
                )
},
datasets={"train": dataset}

)

Any hints?

[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361: PC: @     0x7f15a8181574  (unknown)  ObjectMonitor::enter()
[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f16171a6420       3536  (unknown)
[2023-04-04 07:21:15,161 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7edee0d         80  InterpreterRuntime::monitoren
ter()
[2023-04-04 07:21:15,164 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce8740d        112  (unknown)
[2023-04-04 07:21:15,170 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1        216  (unknown)
[2023-04-04 07:21:15,175 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1         88  (unknown)
[2023-04-04 07:21:15,180 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce671c1        120  (unknown)
[2023-04-04 07:21:15,186 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce66f20        120  (unknown)
[2023-04-04 07:21:15,191 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce66f20         88  (unknown)
[2023-04-04 07:21:15,229 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce67206        176  (unknown)
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f14dce5f50b        120  (unknown)
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eeb194        384  JavaCalls::call_helper()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eec6d7        224  JavaCalls::call_virtual()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7eecc10        160  JavaCalls::call_virtual()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a7f87da1        128  thread_entry()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a82fb425        176  JavaThread::thread_main_inner
()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f15a81a3002        864  java_start()
[2023-04-04 07:21:15,235 E 1858 2075] (python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff) logging.cc:361:     @     0x7f161719a609  (unknown)  start_thread
Fatal Python error: Segmentation fault```

Jules_Damji · April 4, 2023, 4:19pm

Can you check the logs? There might be trace of what caused the SEGV.

xwjiang2010 · April 4, 2023, 9:55pm

Can you just try loading data first (without train) for now?

The stacktrace suggests problem in invoking JVM, so suspecting reading from hdfs failing.

igolant · April 6, 2023, 11:24pm

hi
thanks for quick response

@Jules_Damji i see only

[igolant@krylov-devour-user-deploy-94-845f5fc5f8-z8t9r c5939efe-210a-4f1f-b8c1-93557fb81377]$ less repro/stdout-index-0-attempt-1.log

#  SIGSEGV (0xb) at pc=0x00007f2c185572ab, pid=2017, tid=0x00007f2bc0268700
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.54.0.22-SA-linux64) (8.0_292-b10) (build 1.8.0_292-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libpthread.so.0+0x142ab]  raise+0xcb
#
# Core dump written. Default location: ....
#
# An error report file with more information is saved as:
# 
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/

@xwjiang2010
my dataset is too big to fit in object store - so i use streaming (window) mode

xwjiang2010 · April 7, 2023, 12:13am

got you. How is streaming enabled? I cannot see that from the code you pasted. What ray version are you using?
Could you try streaming ingest + training with a dummy trainer (that basically does nothing other than data pass through)? Want to see isolate training with data ingestion problems.

igolant · April 12, 2023, 9:48pm

@xwjiang2010

i specify max_object_store_memory_fraction in DatasetConfig - internally he creates dataset pipeline.

it fails after several hours (not very predictable) - dont think it really is related to TorchTrainer that i am using

basically, my main problem is not failure itself but that retry doesnt work

xwjiang2010 · April 12, 2023, 10:10pm

@chengsu can you help here?

xwjiang2010 · April 13, 2023, 7:52pm

Actually I am not entirely sure how segv is handled by ray. Where does this segv happen? Does it happen in head node or worker nodes? Which process causes it?

igolant · April 13, 2023, 8:40pm

hi @xwjiang2010
it happens on worker node when it executes read_task
pyarrow hdfs is based on libhdfs library - so it tries to create jvm and fails

xwjiang2010 · April 13, 2023, 9:16pm

cc @amogkam @sangcho to help out

Topic		Replies	Views
Problem with anything on Ray Ray Data	2	629	April 20, 2022
OOM reading "small" parquet file Ray Data	2	1215	September 1, 2022
Cannot read parquet files Ray Data	2	647	April 19, 2023
Ray Dataset Cannot Read Parquet File Ray Data	1	645	August 1, 2022
Cannot read parquet from S3	2	795	October 20, 2022

[Ray Data] error with read_parquet from hdfs

Related topics