My ray.init()
hangs forever on my cluster, probably because my cluster home directory doesn’t allow file locks.
File ~/.conda/envs/py39/lib/python3.9/site-packages/ray/_private/node.py:242, in Node.__init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
235 self._plasma_store_socket_name = self._prepare_socket_file(
236 self._ray_params.plasma_store_socket_name, default_prefix="plasma_store"
237 )
238 self._raylet_socket_name = self._prepare_socket_file(
239 self._ray_params.raylet_socket_name, default_prefix="raylet"
240 )
--> 242 self.metrics_agent_port = self._get_cached_port(
243 "metrics_agent_port", default_port=ray_params.metrics_agent_port
244 )
245 self._metrics_export_port = self._get_cached_port(
246 "metrics_export_port", default_port=ray_params.metrics_export_port
247 )
249 ray_params.update_if_absent(
250 metrics_agent_port=self.metrics_agent_port,
251 metrics_export_port=self._metrics_export_port,
252 )
File ~/.conda/envs/py39/lib/python3.9/site-packages/ray/_private/node.py:801, in Node._get_cached_port(self, port_name, default_port)
798 # Maps a Node.unique_id to a dict that maps port names to port numbers.
799 ports_by_node: Dict[str, Dict[str, int]] = defaultdict(dict)
--> 801 with FileLock(file_path + ".lock"):
802 if not os.path.exists(file_path):
803 with open(file_path, "w") as f:
File ~/.conda/envs/py39/lib/python3.9/site-packages/filelock/_api.py:220, in BaseFileLock.__enter__(self)
214 def __enter__(self) -> BaseFileLock:
215 """
216 Acquire the lock.
217
218 :return: the lock object
219 """
--> 220 self.acquire()
221 return self
File ~/.conda/envs/py39/lib/python3.9/site-packages/filelock/_api.py:187, in BaseFileLock.acquire(self, timeout, poll_interval, poll_intervall, blocking)
185 msg = "Lock %s not acquired on %s, waiting %s seconds ..."
186 _LOGGER.debug(msg, lock_id, lock_filename, poll_interval)
--> 187 time.sleep(poll_interval)
Is there a way to ray.init() without file locks? It seems like the metrics dashboard is the first thing causing an issue. Maybe there’s a way to start ray without metrics?
Alternatively, there’s a non-home directory on the cluster that allows file locks. Would the ray.init() tmp_dir
flag solve my issue?