Ray.init() without file locks

My ray.init() hangs forever on my cluster, probably because my cluster home directory doesn’t allow file locks.

File ~/.conda/envs/py39/lib/python3.9/site-packages/ray/_private/node.py:242, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    235     self._plasma_store_socket_name = self._prepare_socket_file(
    236         self._ray_params.plasma_store_socket_name, default_prefix="plasma_store"
    237     )
    238     self._raylet_socket_name = self._prepare_socket_file(
    239         self._ray_params.raylet_socket_name, default_prefix="raylet"
    240     )
--> 242 self.metrics_agent_port = self._get_cached_port(
    243     "metrics_agent_port", default_port=ray_params.metrics_agent_port
    244 )
    245 self._metrics_export_port = self._get_cached_port(
    246     "metrics_export_port", default_port=ray_params.metrics_export_port
    247 )
    249 ray_params.update_if_absent(
    250     metrics_agent_port=self.metrics_agent_port,
    251     metrics_export_port=self._metrics_export_port,
    252 )

File ~/.conda/envs/py39/lib/python3.9/site-packages/ray/_private/node.py:801, in Node._get_cached_port(self, port_name, default_port)
    798 # Maps a Node.unique_id to a dict that maps port names to port numbers.
    799 ports_by_node: Dict[str, Dict[str, int]] = defaultdict(dict)
--> 801 with FileLock(file_path + ".lock"):
    802     if not os.path.exists(file_path):
    803         with open(file_path, "w") as f:

File ~/.conda/envs/py39/lib/python3.9/site-packages/filelock/_api.py:220, in BaseFileLock.__enter__(self)
    214 def __enter__(self) -> BaseFileLock:
    215     """
    216     Acquire the lock.
    217 
    218     :return: the lock object
    219     """
--> 220     self.acquire()
    221     return self

File ~/.conda/envs/py39/lib/python3.9/site-packages/filelock/_api.py:187, in BaseFileLock.acquire(self, timeout, poll_interval, poll_intervall, blocking)
    185             msg = "Lock %s not acquired on %s, waiting %s seconds ..."
    186             _LOGGER.debug(msg, lock_id, lock_filename, poll_interval)
--> 187             time.sleep(poll_interval)

Is there a way to ray.init() without file locks? It seems like the metrics dashboard is the first thing causing an issue. Maybe there’s a way to start ray without metrics?

Alternatively, there’s a non-home directory on the cluster that allows file locks. Would the ray.init() tmp_dir flag solve my issue?

Would the ray.init() tmp_dir flag solve my issue?

hi @alexlenail, yeah changing the tmp_dir should solve the problem for you.