Raylet worker doesn't respect RAY_TMPDIR

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: v2.44.1
  • Python version: python 3.12
  • OS: centos 8.4
  • Cloud/Infrastructure: Premise
  • Other libs/tools (if relevant): No

3. What happened vs. what you expected:

  • Expected:
    • env RAY_TMPDIR=/nfs/dev/${hostname} ray start ***. It should respect the RAY_TMPDIR.
  • Actual:
    • For above command, it respects the head’s RAY_TMPDIR. RAY_TMPDIR specifies a NFS path which is different for differnt compute node. If all worker nodes share the same temp dir, the /nfs/DEV/PLT/zpeng/ray/tmp/twdev2/ray/session_latest/node_ip_address.json will be overwritten .

checked some code, for non-head worker, it queries GCS for temp dir. What’s the reason for it?

def \_init_temp(self):

    \# Create a dictionary to store temp file index.

    self.\_incremental_dict = collections.defaultdict(lambda: 0)



    if self.head:

        self.\_ray_params.update_if_absent(

            temp_dir=ray.\_private.utils.get_ray_temp_dir()

        )

        self.\_temp_dir = self.\_ray_params.temp_dir

    else:

        if self.\_ray_params.temp_dir is None:

            assert not self.\_default_worker

            temp_dir = ray.\_private.utils.internal_kv_get_with_retry(

                self.get_gcs_client(),

                "temp_dir",

                ray_constants.KV_NAMESPACE_SESSION,

                num_retries=ray_constants.NUM_REDIS_GET_RETRIES,

            )

            self.\_temp_dir = ray.\_private.utils.decode(temp_dir)

        else:

            self.\_temp_dir = self.\_ray_params.temp_dir