### What happened + What you expected to happen
Using PyArrow fs with HDFS work…s fine outside a ray session:
```
file_sys, file_path = pyarrow.fs.FileSystem.from_uri(hdfs_folder)
file_infos = file_sys.get_file_info(pyarrow.fs.FileSelector(file_path, recursive=False))
```
However, after `ray.init()`, the same code results in a segmentation fault:
```
2023-06-14 01:27:37,622 INFO worker.py:1614 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
*** SIGSEGV received at time=1686731258 on cpu 0 ***
PC: @ 0x7f99d20c5822 (unknown) (unknown)
@ 0x7f996fa6ec85 208 absl::lts_20220623::WriteFailureInfo()
@ 0x7f996fa6e9c8 64 absl::lts_20220623::AbslFailureSignalHandler()
@ 0x7f99e81c6420 3408 (unknown)
@ 0x7f99d1c2782e 48 (unknown)
@ 0x7f99d1c2cc0f 240 (unknown)
@ 0x7f99d2267a5f 144 (unknown)
@ 0x7f99d2267d53 128 (unknown)
@ 0x7f99d21092a0 64 (unknown)
@ 0x7f99e81ba609 (unknown) start_thread
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361: *** SIGSEGV received at time=1686731258 on cpu 0 ***
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361: PC: @ 0x7f99d20c5822 (unknown) (unknown)
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361: @ 0x7f996fa6ec85 208 absl::lts_20220623::WriteFailureInfo()
[2023-06-14 01:27:38,592 E 9716 9731] logging.cc:361: @ 0x7f996fa6e9e1 64 absl::lts_20220623::AbslFailureSignalHandler()
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99e81c6420 3408 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99d1c2782e 48 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99d1c2cc0f 240 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99d2267a5f 144 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99d2267d53 128 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99d21092a0 64 (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361: @ 0x7f99e81ba609 (unknown) start_thread
Fatal Python error: Segmentation fault
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f99e81c62ab, pid=9716, tid=0x00007f99baa56700
#
# JRE version: OpenJDK Runtime Environment (8.0_362-b09) (build 1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.362-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libpthread.so.0+0x142ab] raise+0xcb
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /ray/hs_err_pid9716.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
[failure_signal_handler.cc : 332] RAW: Signal 6 raised at PC=0x7f99e800300b while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1686731258 on cpu 0 ***
PC: @ 0x7f99e800300b (unknown) raise
@ 0x7f996fa6ec85 208 absl::lts_20220623::WriteFailureInfo()
@ 0x7f996fa6e9c8 64 absl::lts_20220623::AbslFailureSignalHandler()
@ 0x7f99e81c6420 3952 (unknown)
@ 0x7f99d22c3843 240 (unknown)
@ 0x7f99d211410e 352 JVM_handle_linux_signal
@ 0x7f99d210731c 64 (unknown)
@ 0x7f99e81c6420 10576 (unknown)
@ 0x7f99d1c2782e 48 (unknown)
@ 0x7f99d1c2cc0f 240 (unknown)
@ 0x7f99d2267a5f 144 (unknown)
@ 0x7f99d2267d53 128 (unknown)
@ 0x7f99d21092a0 64 (unknown)
@ 0x7f99e81ba609 (unknown) start_thread
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: *** SIGABRT received at time=1686731258 on cpu 0 ***
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: PC: @ 0x7f99e800300b (unknown) raise
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f996fa6ec85 208 absl::lts_20220623::WriteFailureInfo()
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f996fa6e9e1 64 absl::lts_20220623::AbslFailureSignalHandler()
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f99e81c6420 3952 (unknown)
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f99d22c3843 240 (unknown)
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f99d211410e 352 JVM_handle_linux_signal
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: @ 0x7f99d210731c 64 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99e81c6420 10576 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99d1c2782e 48 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99d1c2cc0f 240 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99d2267a5f 144 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99d2267d53 128 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99d21092a0 64 (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361: @ 0x7f99e81ba609 (unknown) start_thread
Fatal Python error: Aborted
```
Here is the log dump from java:
[hs_err_pid9716.log](https://github.com/ray-project/ray/files/11743507/h5PTZ.log)
The segfault occurs almost every time, but not always.
It never occurs when ray is not initialized. Thus there is probably some interference between the ray session/global state and the java/pyarrow/hdfs connection.
### Versions / Dependencies
Ray latest master, hadoop 3.2.4, java openjdk version "1.8.0_362"
### Reproduction script
- Install HDFS with `./ci/env-install-hdfs.sh`
- Create some directory in HDFS e.g. with `/opt/hadoop-3.2.4/bin/hdfs dfs -put /tmp/somewhere hdfs://[host]:8020/somewhere`
- Run this script
```
def setup_hdfs():
"""Set env vars required by pyarrow to talk to hdfs correctly.
Returns hostname and port needed for the hdfs uri."""
# the following file is written in `install-hdfs.sh`.
with open("/tmp/hdfs_env", "r") as f:
for line in f.readlines():
line = line.rstrip("\n")
tokens = line.split("=", maxsplit=1)
os.environ[tokens[0]] = tokens[1]
import sys
sys.path.insert(0, os.path.join(os.environ["HADOOP_HOME"], "bin"))
hostname = os.getenv("CONTAINER_ID")
port = os.getenv("HDFS_PORT")
return hostname, port
import os
import pyarrow
import pyarrow.fs
hostname, port = setup_hdfs()
workspace_dir = f'hdfs://{hostname}:{port}/somewhere'
# from ray.air._internal.remote_storage import upload_to_uri
# upload_to_uri("/tmp/content", workspace_dir)
def get_list_of_files_under_hdfs_folder(hdfs_folder):
file_sys, file_path = pyarrow.fs.FileSystem.from_uri(hdfs_folder)
file_infos = file_sys.get_file_info(pyarrow.fs.FileSelector(file_path, recursive=False))
return file_infos
print(f"Success!, number of files in {workspace_dir}: {len(get_list_of_files_under_hdfs_folder(workspace_dir))}")
print(f"Success!, number of files in {workspace_dir}: {len(get_list_of_files_under_hdfs_folder(workspace_dir))}")
print("initializing ray, and get number of files again.")
import ray
ray.is_initialized()
ray.init()
print("After ray init", len(get_list_of_files_under_hdfs_folder(workspace_dir)))
```
### Issue Severity
High: It blocks me from completing my task.