Job working dir empty

jleben · January 28, 2025, 7:16pm

Hello!

I have an issue where the working dir specified in ray job submit --working-dir is not transmitted to the cluster. Interestingly, if using a subdirectory of the same directory instead, it is transmitted successfully.

So, this works fine (note the output contains the working dir printed by pwd and a filename test.py listed by ls:

ray job submit --working-dir /code/ray-test -- bash -c 'pwd; ls'

...
Tailing logs until the job exits (disable with --no-wait):
2025-01-28 19:10:16,287	INFO job_manager.py:530 -- Runtime env is setting up.
/tmp/ray/session_2025-01-28_18-55-45_577283_70/runtime_resources/working_dir_files/_ray_pkg_879177c81ec48c5a
test.py

------------------------------------------
Job 'raysubmit_uHi6hwrMLtNn19RB' succeeded
------------------------------------------

But here ls lists no files:

ray job submit --working-dir /code -- bash -c 'pwd; ls'

...
Tailing logs until the job exits (disable with --no-wait):
2025-01-28 19:12:17,840	INFO job_manager.py:530 -- Runtime env is setting up.
/tmp/ray/session_2025-01-28_18-55-45_577283_70/runtime_resources/working_dir_files/_ray_pkg_3030303030303030

------------------------------------------
Job 'raysubmit_zAduUSWTJVs35DHh' succeeded
------------------------------------------

I should add that I am running ray job submit from a Docker container on a local machine and /code is a directory mounted into that container from the local machine.

Anyone knows why /code won’t sync to the cluster and how to work around this?

christina · January 28, 2025, 8:40pm

Hey there Welcome back!
My first question is… how big is the /code directory? According to the documentation,

working_dir (str): Specifies the working directory for the Ray workers. This must either be (1) an local existing directory with total size at most 100 MiB, (2) a local existing zipped file with total unzipped size at most 100 MiB (Note: excludes has no effect), or (3) a URI to a remotely-stored zip file containing the working directory for your job (no file size limit is enforced by Ray). See Remote URIs for details. The specified directory will be downloaded to each node on the cluster, and Ray workers will be started in their node’s copy of this directory.

If /code exceeds 100 MiB that might be why it’s not syncing?

Some relevant docs:

jleben · January 31, 2025, 10:53pm

I’ve got to the bottom of this, and I think there is a bug in the way Ray is interpreting .gitignore rules, specifically the .* rule.

Consider this set of files:

root@ac4b19df7aa1:/code/ray_test# ls -lhA
total 111M
-rw-r--r-- 1 root root   17 Jan 31 22:46 .gitignore
-rw-r--r-- 1 root root 110M Jan 31 21:48 big.ignored
-rw-r--r-- 1 root root    3 Jan 31 21:50 small
-rw-r--r-- 1 root root    2 Jan 31 21:52 small.ignored

With nothing in .gitignore, I get a proper error from Ray that the directory is too large (as expected):

root@7e3eba0d0c82:/code/infrastructure/ray# ray job submit --working-dir /code/ray_test/ -- bash -c 'pwd; ls'
Job submission server address: http://10.212.131.202:8265
2025-01-31 22:48:43,524	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_3358a1403f9393f2.zip.
2025-01-31 22:48:43,525	INFO packaging.py:574 -- Creating a file package for local module '/code/ray_test/'.
2025-01-31 22:48:43,525	WARNING packaging.py:416 -- File /code/ray_test/big.ignored is very large (110.00MiB). Consider adding this file to the 'excludes' list to skip uploading it: `ray.init(..., runtime_env={'excludes': ['/code/ray_test/big.ignored']})`
Traceback (most recent call last):
...
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/dashboard/modules/job/job_head.py", line 313, in upload_package
    data = await req.read()
  File "/usr/local/lib/python3.10/site-packages/aiohttp/web_request.py", line 669, in read
    raise HTTPRequestEntityTooLarge(
aiohttp.web_exceptions.HTTPRequestEntityTooLarge: Request Entity Too Large

Putting the following into .gitignore:

*.ignored

Only the non-ignored file small is successfully transmitted:

root@7e3eba0d0c82:/code/infrastructure/ray# ray job submit --working-dir /code/ray_test/ -- bash -c 'pwd; ls'
...
/tmp/ray/session_2025-01-31_20-54-39_960259_69/runtime_resources/working_dir_files/_ray_pkg_de5d3921a5cdf9d8
small

Now, putting .* into the .gitignore, which should not match any files in my working dir:

.*
*.ignored

Ray ignores all files!

root@7e3eba0d0c82:/code/infrastructure/ray# ray job submit --working-dir /code/ray_test/ -- bash -c 'pwd; ls'
...
/tmp/ray/session_2025-01-31_20-54-39_960259_69/runtime_resources/working_dir_files/_ray_pkg_3030303030303030

Is Ray interpreting .* as a regular expression? That’s not how it is defined in Git docs!

christina · February 4, 2025, 12:26am

Thank you for sharing your solution!! I’ll bring this up with the team and see if it’s intended behavior.

Topic		Replies	Views
After submitting the job, it remains stuck at the "Creating file package" stage Ray Clusters	1	131	June 11, 2024
Upload files to a Ray cluster without changing `working_dir` Ray Core	1	481	October 11, 2024
Exclude files/folders not working Ray Core	2	1120	March 8, 2023
Ray job submit API doesn't work well Dashboard, Monitoring & Debugging	2	110	March 18, 2025
Ray serving with working directory as folder location Ray Serve	3	1981	December 14, 2022

Job working dir empty

Related topics