Ray cluster connection dies after passing large array for calculation

Hello all! Ray novice here.

I have encountered a curious bug. I have a large array of strings, I pass this large array
to a function that calculates some numbers based on the string. Calculations are quite expensive,
so for my trial array I need 45 min calculation time on a 4x64 core AWS EC2 instance cluster.
Ray performs wonderfully.
However, as soon as I increase the array size (in this case, from 190 MB to 1.9 GB, shortly after I start
the processing script, the cluster crashes and I cannot connect to it anymore.
I run the script in a tmux session on the head node. This is the output I get from this session:

ENGAGE! # output of the script, telling me its started to run
Shared connection to 3.125.45.70 closed.
Error: Command failed:                        "ip-172-31-30-166" 13:46 17-Mar-21

  ssh -tt -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ($SHELL)'"'"'"'"'"'"'"'"''"'"' )'

When I try to connect to dashboard this is what I get:

ray dashboard config2.yaml
Attempting to establish dashboard locally at localhost:8265 connected to remote port 8265
2021-03-17 21:47:25,895	VWARN commands.py:255 -- Loaded cached provider configuration from /tmp/ray-config-07e1cc0e8b2c9c9bf8c2776e266a60f754847d4c
2021-03-17 21:47:25,895	WARN commands.py:260 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2021-03-17 21:47:27,065	INFO command_runner.py:356 -- Fetched IP: 3.125.45.70
2021-03-17 21:47:27,067	INFO log_timer.py:25 -- NodeUpdater: i-099977f1d0c388556: Got IP  [LogTimer=2ms]
2021-03-17 21:47:27,068	INFO command_runner.py:484 -- Forwarding ports
2021-03-17 21:47:27,069	VINFO command_runner.py:488 -- Forwarding port 8265 to port 8265 on localhost.
2021-03-17 21:47:27,071	VINFO command_runner.py:508 -- Running `None`
2021-03-17 21:47:27,072	VVINFO command_runner.py:510 -- Full command is `ssh -tt -L 8265:localhost:8265 -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 while true; do sleep 86400; done`
mux_client_request_session: read from master failed: Broken pipe
Connection timed out during banner exchange
Error: Failed to forward dashboard from remote port 8265 to local port 8265. There are a couple possibilities: 
 1. The remote port is incorrectly specified 
 2. The local port 8265 is already in use.
 The exception is: Command failed:

  ssh -tt -L 8265:localhost:8265 -i /home/msl/VLX.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8cf205e11d/afdaa39097/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@3.125.45.70 while true; do sleep 86400; done

The code I am running:

ray.init(address='auto')

    @ray.remote
    def calculateStuff(array,index):
        sequence = array[index]
        throwlist = somecalc(sequence)
        return throwlist

permutlist_id = ray.put(permutlist)
result_ids = []
result_ids = [calculateStuff.remote(permutlist_id, i) for i in range(0,len(permutlist))]

The large array is permutlist.
Has anyone else experienced Ray crashing at a certain data size? I mean I get at a certain point, but 2 GB should be doable, right?

EDIT: Forgot to add specs. I am starting and controlling the cluster out of Ubuntu 20.04.
Cluster runs Ray 1.2.0 on an AWS EC2 instance running Ubuntu 20.04 LTS.

What version of Ray are you on? If you’re on master, maybe you could run ray cluster-dump YAML and post the logs here?

Thanks for the quick reply! Posted to quickly, my apologies, version infos are now in the original questions. I am on Ray 1.2.0. Unfortunately, the cluster-dump command did not work. I worked through the Troubleshooting guide just now and checked ulimit -Hn on the head node which gives me 1 048 576. Since my trial array (which runs) has 362 880 elements and the array which crashes has 3 628 800, could it simply be that I am running out of file descriptors as described?

I also checked the /tmp/ray/session_/logs/monitor folder, this is what I see in there:

dashboard.log                                                                         worker-2c7ff6864737aa3f47bc020bc8c0e09288dc4dbc4e87baa0e8aae52d-02000000-1817.err
dashboard_agent.log                                                                   worker-2c7ff6864737aa3f47bc020bc8c0e09288dc4dbc4e87baa0e8aae52d-02000000-1817.out
gcs_server.err                                                                        worker-2f122984e54aa577c3b417ad847c5bfbf21e635f7f74462b82acc8fb-02000000-1984.err
gcs_server.out                                                                        worker-2f122984e54aa577c3b417ad847c5bfbf21e635f7f74462b82acc8fb-02000000-1984.out
log_monitor.log                                                                       worker-317830d9972d0a9e9d06cca69a546dc225e26ef00e2e0488396e596d-01000000-376.err
monitor.err                                                                           worker-317830d9972d0a9e9d06cca69a546dc225e26ef00e2e0488396e596d-01000000-376.out
monitor.log                                                                           worker-34b70bb980101c00634885698607b866ddbad8d55c0c69ed5aac781b-02000000-1947.err
monitor.out                                                                           worker-34b70bb980101c00634885698607b866ddbad8d55c0c69ed5aac781b-02000000-1947.out
old                                                                                   worker-391842ed92e97d384841f625783ba2c9b6b44fd9b7d7fafc1ba9aa54-02000000-1785.err
plasma_store.err                                                                      worker-391842ed92e97d384841f625783ba2c9b6b44fd9b7d7fafc1ba9aa54-02000000-1785.out
plasma_store.out                                                                      worker-3c950ef6f2eb059b895750868bcf774cbf78d49842d1238cb15feafa-02000000-1751.err
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_307.log   worker-3c950ef6f2eb059b895750868bcf774cbf78d49842d1238cb15feafa-02000000-1751.out
python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_1725.log  worker-3cac36a8fec62b40f92199b7b3060dca1f712eebacacb627a694da55-02000000-1865.err
python-core-worker-016af9bb6b987df5d2af83546a29f10302ef9b1e581ba5bf6e7918f8_2030.log  worker-3cac36a8fec62b40f92199b7b3060dca1f712eebacacb627a694da55-02000000-1865.out
python-core-worker-019f08fa09fc7b27883eacbe43ceb698857766b6576ce046d419fac5_438.log   worker-3ee8487b556035cfbf1a75f6eab6ca1b25589770998cffb635a808b0-02000000-1907.err
python-core-worker-01ff276c15e896251c0a0a82aea89b27a6125ce7ae3ac2f284c6c13e_820.log   worker-3ee8487b556035cfbf1a75f6eab6ca1b25589770998cffb635a808b0-02000000-1907.out
python-core-worker-0382e6632ca85f2b0dfdb0ac159545e9f1ff77695f1bc866e8e2c942_402.log   worker-3fba9801ef714d5357e39eeac49736ae782e39cb8f05572c156cafa2-02000000-1808.err
python-core-worker-0590f388e333439098316330a971bfe40a41f83836e2a037e51fd9f0_1788.log  worker-3fba9801ef714d5357e39eeac49736ae782e39cb8f05572c156cafa2-02000000-1808.out
python-core-worker-077e2d5696ad09f4171a5858875ad5c7c22cadaf02e65e8829e20758_2044.log  worker-45fde6cc998bda299741d3e2acd3fe3bd33438835e563729510a8ad0-02000000-1821.err
python-core-worker-09155de0d8ca4eff21d8f9213296d7370aa065f1555bbb87fd79d912_1816.log  worker-45fde6cc998bda299741d3e2acd3fe3bd33438835e563729510a8ad0-02000000-1821.out
python-core-worker-0b8186865a9bec383d2d76d7862ff984069973bdb055f1000f5bda1b_1893.log  worker-460a4ad190ce9328b35a3582e913e997b7bc144692264ef574983759-02000000-1788.err
python-core-worker-0dc3060501bbac9918329ed193c794299b882f20cbef059f27462a5f_408.log   worker-460a4ad190ce9328b35a3582e913e997b7bc144692264ef574983759-02000000-1788.out
python-core-worker-103cbbca97f2dad54b0119d0ee2888bc0e83e689845b2faa8a41bf8a_1904.log  worker-4682747f0a932d36bd9eeca439e2fd3aaff129f133a4ddc6f0628f3b-02000000-1877.err
python-core-worker-122d791e032608427e9b585f7f6bdb7c097d8ab434d0dcb0e355f1ee_1786.log  worker-4682747f0a932d36bd9eeca439e2fd3aaff129f133a4ddc6f0628f3b-02000000-1877.out
python-core-worker-12b1036c39d7b28fc908ba317e07fa02f29ab18c8b7fb8d2bdad6bcc_1798.log  worker-479fc51da64641df43677f7df34d10673123b4aec77a4d7b347892fa-01000000-356.err
python-core-worker-15e86448f49f5f2341a8673e8b55b4434d22d3c17dd7c593ab5c0541_1883.log  worker-479fc51da64641df43677f7df34d10673123b4aec77a4d7b347892fa-01000000-356.out
python-core-worker-1726cebbfe06d89d9de3fcac5af62f8aaa89009868b689592695c4e0_2127.log  worker-4af7559197115de122a7f731ee410ebdce068d89ef5b87e70bbec2e1-01000000-447.err
python-core-worker-17960ff10c68d9331673a211a184dd9cbccb6421851b7a18b751a347_1823.log  worker-4af7559197115de122a7f731ee410ebdce068d89ef5b87e70bbec2e1-01000000-447.out
python-core-worker-1861ea6c71be426781fc22516e5f51045f7876ceb2bcafeda19f1bdd_1894.log  worker-4b468c7518369dd5cbee6cdd7f434533ab12fbd1848cd3f5972fa45f-02000000-1807.err
python-core-worker-1b1467eb044c75e426563bad61cb5067f7cb017fa9f6aeb4bf497fa8_1861.log  worker-4b468c7518369dd5cbee6cdd7f434533ab12fbd1848cd3f5972fa45f-02000000-1807.out
python-core-worker-1def552a4cee75a9f3ade06d17f4c6f8153306673e1a55bd579531db_368.log   worker-50bf9d177ae551058f042195219a38a1790ce12048d98db98e8f6ec5-01000000-410.err
python-core-worker-1edafcfc46c4f79deb78bad8197d6af0ec5a04a5430973e3d9640e4d_465.log   worker-50bf9d177ae551058f042195219a38a1790ce12048d98db98e8f6ec5-01000000-410.out
python-core-worker-2015ae7f24ab305bdf2c24411c5b8facc88e4eae8490a2a0f08a6638_1814.log  worker-53a8f9adfb024d3222af7e1b32f5d9e28563ae208f56b2c62c17abef-01000000-366.err
python-core-worker-234a1905923d8399ac5c0bef30066e5915619241a40a67fa46077b21_379.log   worker-53a8f9adfb024d3222af7e1b32f5d9e28563ae208f56b2c62c17abef-01000000-366.out
python-core-worker-23851e93a3ada728b5121209974d0eae1e61dd9548bc872dc54b0f16_1907.log  worker-541011d3d93e9714acf5b503ef1104af8e4e214c9c9afff948bc3832-02000000-1870.err
python-core-worker-2a5391bf78c8397d55c8e354c5eafdf33b698059f17b09b932de90cd_1877.log  worker-541011d3d93e9714acf5b503ef1104af8e4e214c9c9afff948bc3832-02000000-1870.out
python-core-worker-2bf8b06cae848127e8a90dff6fcb42f867c4534eeadd40cec3489d0d_442.log   worker-5448153a74e39cd3e5d8f9f84c983a037befafb21ca9bca08a08e557-01000000-409.err
python-core-worker-2cf617a2d34c04dd3b908693be25f54ca382f4b506b3a80ffda8f13b_405.log   worker-5448153a74e39cd3e5d8f9f84c983a037befafb21ca9bca08a08e557-01000000-409.out
python-core-worker-2d38a449cfaad53bd50b71d1ffced3d22157bdabc523215a17571318_1828.log  worker-57a11eb2e3a69aeef74e7a570a1a0936d81b64c5bbad471b5ca8e42a-01000000-404.err
python-core-worker-2d69ec8e7b15f8d653b3cc4be1dc09d5b4b9af7fa035c3b8d0ae962c_1820.log  worker-57a11eb2e3a69aeef74e7a570a1a0936d81b64c5bbad471b5ca8e42a-01000000-404.out
python-core-worker-3075442a943dc9576eac21dd9f7afbe2c4ce9b11ee22fa03ba60644b_1864.log  worker-59a72d33c7d034b6f3e4cee80cb6593adabcb88f964366e36e9f8c20-01000000-357.err
python-core-worker-312033d42753c57119ebcafa286dc24824d9074522d91da2ce65a478_433.log   worker-59a72d33c7d034b6f3e4cee80cb6593adabcb88f964366e36e9f8c20-01000000-357.out
python-core-worker-331c830366986af348a7dc89cbe3ba7c034904762b8aa3736375b41e_2758.log  worker-5ab3b5938298a4ac2de5842dc4b4aeab652ed43f2f05dc42b954f7ce-01000000-445.err
python-core-worker-333402b0e776afa618513e507f57dd7278c27303c962b6e228de5153_1805.log  worker-5ab3b5938298a4ac2de5842dc4b4aeab652ed43f2f05dc42b954f7ce-01000000-445.out
python-core-worker-338c174e2785370833c7f682ab45b4d5ffd0497dd7fd9fa7dd222860_1947.log  worker-5c19b4c6a2e66e11a0549dade03094ccbf43a49e7516a34b47353191-01000000-361.err
python-core-worker-34a013618b6aff3a2abbbf263fe001104d1c6d3ba06b89624e9e036e_401.log   worker-5c19b4c6a2e66e11a0549dade03094ccbf43a49e7516a34b47353191-01000000-361.out
python-core-worker-369aeabdeccdd90c0c7daa3524ae8135f84c3125ee9cca45b0b1174f_384.log   worker-5e3f3f40cd071aa105902e8e4f2bfe0850522313d45b3a4d297d192b-02000000-1783.err
python-core-worker-36ab1598bb4ddcb884b620654ebc6141f143fc845aacc063a307410c_412.log   worker-5e3f3f40cd071aa105902e8e4f2bfe0850522313d45b3a4d297d192b-02000000-1783.out
python-core-worker-3861963b08376c9441e92e46c658930ff72eedf77f1ad683f765b66f_832.log   worker-60a5eb8d0c42ab3fb77ebfbaf76a8861a36e1d2bb465df00e36dd807-02000000-1898.err
python-core-worker-3867ac1f20926a498115a1262e5cce040fa399db9300f574a35d31f2_366.log   worker-60a5eb8d0c42ab3fb77ebfbaf76a8861a36e1d2bb465df00e36dd807-02000000-1898.out
python-core-worker-38ec4a2679c0c831f7952b8fbf1fdfee797daf70594f1e1a46f6287f_1811.log  worker-61c83937c6c4225348af6e8166f90be067b22e142a837c9ad2642cfe-02000000-1750.err
python-core-worker-3b2b022003f41d27390aa10a23c43ba53a3d4e218b995fdcf3f1135a_398.log   worker-61c83937c6c4225348af6e8166f90be067b22e142a837c9ad2642cfe-02000000-1750.out
python-core-worker-3da98bfec0192d98f134436786721fcb6c2dfa971fa087acb5ac359e_1817.log  worker-61fa802ca2e872ea6e55da49f228c362641cb84b34de18451e44906d-02000000-1819.err
python-core-worker-3dacf1e87812fce3b9d0433ce5e06d707d4e267ec821577a5cfdae22_1762.log  worker-61fa802ca2e872ea6e55da49f228c362641cb84b34de18451e44906d-02000000-1819.out
python-core-worker-42ef997a5003db7a8691e5372c028570b12fe29f6ee464f5bb45cd7a_387.log   worker-6423ef1463b6459610bc7c397f9380b7c59540b1307b3f088c2e31c9-01000000-353.err
python-core-worker-436edf9a46f09074abbc19d4d98b9500cf7d1f4d1a72cc8e1ee499f5_461.log   worker-6423ef1463b6459610bc7c397f9380b7c59540b1307b3f088c2e31c9-01000000-353.out
python-core-worker-45e0f88d0ff3658d4434e70067807fa3d1630a4e95520fd15258f646_354.log   worker-6549c832e049bcd27ed7deb56016c081d2f66f52bbd2d57fd48dc2a8-02000000-1864.err
python-core-worker-4c421305cbd871dc8f3ce5d0735b2f66f614729a9deee10883fb320d_410.log   worker-6549c832e049bcd27ed7deb56016c081d2f66f52bbd2d57fd48dc2a8-02000000-1864.out
python-core-worker-4d281c89e278ab7056dc3cdb41dc211f6bfe86aac3e2be78e8e86668_1837.log  worker-664771316586eea0e8c5bad25dd2a3ebc13165b4e873f8a65a4ae378-02000000-1762.err
python-core-worker-4e116127a02db111dea011b46123095cc43b39925e99cdc38964fe9a_396.log   worker-664771316586eea0e8c5bad25dd2a3ebc13165b4e873f8a65a4ae378-02000000-1762.out
python-core-worker-4f426164ef7cda9c10073dde35ea5243157004ea478badad78fb6b56_1813.log  worker-670b29f4850813143bb421c3ec02442cf08ba946adc68264280082b5-01000000-368.err
python-core-worker-5217e5633eb07138bb503afbbf92c22810dc9f11413e110e7726cf81_403.log   worker-670b29f4850813143bb421c3ec02442cf08ba946adc68264280082b5-01000000-368.out
python-core-worker-532c88b561a37eac6c99449f15240591f5396ad08f6f6a0662e26e02_429.log   worker-680d4825ca5f2f1cbc6ea38d363142fc8b1c458948265758952f48f0-01000000-400.err
python-core-worker-55635eea13640fe6403a886d91d8015a4d2f769060500797a025e558_358.log   worker-680d4825ca5f2f1cbc6ea38d363142fc8b1c458948265758952f48f0-01000000-400.out
python-core-worker-5619f6803ba96eb42f2bd2eb06328e7f20a69cc32d71fcb81a8a30f0_372.log   worker-68174b0d7b3693fc604ec52a5f1a0e539bd98a6c3953a7cdd977be1c-02000000-1814.err
python-core-worker-5621654c9088ccb71321035d72f7e47f2bc67828b1c40299ff7cd965_1752.log  worker-68174b0d7b3693fc604ec52a5f1a0e539bd98a6c3953a7cdd977be1c-02000000-1814.out
python-core-worker-57eae7b8f7a459b389a94583e70a9500e0285bbe3a751a4e643d7cbb_699.log   worker-6835750267700fe8a14245cd21da76a5a5d1e86b44921df30599d54f-01000000-396.err
python-core-worker-5ecbc6913efe04fdd0ce23f9fa0c6a7f6f89c27c971944cd3bb4f846_357.log   worker-6835750267700fe8a14245cd21da76a5a5d1e86b44921df30599d54f-01000000-396.out
python-core-worker-625a16c29ee1e4bc8b0ca6eddae515ba4a35727889dce18a24a86f41_406.log   worker-68f40c3085a61a5ce8c3e1eae63a47d671dbacdcdbccbbdfa482b6c2-02000000-1850.err
python-core-worker-6356a2a4a745724240d002ff98fafbfb7c6011e41d253d95241d8086_1815.log  worker-68f40c3085a61a5ce8c3e1eae63a47d671dbacdcdbccbbdfa482b6c2-02000000-1850.out
python-core-worker-63d6e4bc3d048e12a3a2b8820a20c12851110dbfd08e87f453403431_447.log   worker-6bada52998c9deb09670527d54e0d52118f3cd4f098aeab793241b95-01000000-834.err
python-core-worker-65804ff3fb39e40819aad5731b8a893612200513b3dce735a219a3df_1810.log  worker-6bada52998c9deb09670527d54e0d52118f3cd4f098aeab793241b95-01000000-834.out
python-core-worker-6a7b99b01df1cbbe6cb439a6160b7778ae5c9a5c1ac41bd084d97d57_1853.log  worker-6c8fdc66d62eb99fc9d9e6a13047f3bebc228c47b83bcb48c57782a0-01000000-352.err
python-core-worker-6c045d0f9dc98cf6e95fc2b0b9140d220064f70146baec25914f607f_1964.log  worker-6c8fdc66d62eb99fc9d9e6a13047f3bebc228c47b83bcb48c57782a0-01000000-352.out
python-core-worker-6e1977a01f46b54434c5af4fa3607874b7282c616b2507e46516b569_1984.log  worker-71b7dc7a36bcad12c80b8aaa45f58a9e6e928c97fc0878c98fac9f0b-01000000-929.err
python-core-worker-704395b466c12815bbf2a87ea7421f24b5fd0c938d8f0862190a6d7d_1804.log  worker-71b7dc7a36bcad12c80b8aaa45f58a9e6e928c97fc0878c98fac9f0b-01000000-929.out
python-core-worker-71110891d5d98c85fea07797a6eadb21e4e465a5b27c9f5c02349c04_1145.log  worker-73e1c254e39deb5187cf598dd19e69cfe301c10d92e1f2661339b651-01000000-395.err
python-core-worker-7152f3315f7bc6a8bf8f67d30fe8c96a277693bb2d18637ca4b92cd0_362.log   worker-73e1c254e39deb5187cf598dd19e69cfe301c10d92e1f2661339b651-01000000-395.out
python-core-worker-7187e9fc5a5cf6321767835156489c64afb0f73d64b12741b8113cdf_1827.log  worker-75e2ca9a188f4cff23225df160d03f3b2a68d5cce221abfc7dd82905-01000000-461.err
python-core-worker-7230c2636462f7fd50c95f89593b5ad32940ae74e146f5911d990010_1834.log  worker-75e2ca9a188f4cff23225df160d03f3b2a68d5cce221abfc7dd82905-01000000-461.out
python-core-worker-73407c4ed7aacc4d9f3c10ed9786ca32f329b0b9082d3bc0b9f15727_451.log   worker-76d3fb8d8d14846b63196c086021358c32e02ab44f05cfa256637268-02000000-1823.err
python-core-worker-7aebe1a2f48a810fcd0f34da830dff81cb766859c7f5797dfb96fe93_394.log   worker-76d3fb8d8d14846b63196c086021358c32e02ab44f05cfa256637268-02000000-1823.out
python-core-worker-7afbb75ae95001c468a6d1c0e55772cdc3ec5185848dd1dffb9a79fb_364.log   worker-77d7d94da803c0231bc58223b3217b7e59b3b4855c422d21a9c2eda9-01000000-408.err
python-core-worker-7ba4030c68ae664a2837211bd67ebf61d2a37fe46196378e7817694d_2733.log  worker-77d7d94da803c0231bc58223b3217b7e59b3b4855c422d21a9c2eda9-01000000-408.out
python-core-worker-7edc6c19dff96fe802397ea2a82c71910f8f5e28966b53ec4451f462_382.log   worker-7ceee78ebb4908a4004b513a2be7d4bd83162fddfe1011753a2ffac2-01000000-365.err
python-core-worker-7fd85e37b87dd5b1fe52d35f318f469cc59e25ecafacfe81ac8ebf9b_445.log   worker-7ceee78ebb4908a4004b513a2be7d4bd83162fddfe1011753a2ffac2-01000000-365.out
python-core-worker-81a9df68db397fc0eb0d04182014ff4922cd21c8ab7816100535c4ca_444.log   worker-7d73066e77b62c3ac8255eb19554000de5744c8076769154c70809b9-02000000-1881.err
python-core-worker-83344ed104f11e65a218e14afd9d01358bd6b208611e3133ea9ba2d3_1843.log  worker-7d73066e77b62c3ac8255eb19554000de5744c8076769154c70809b9-02000000-1881.out
python-core-worker-8589bc06938b077ce1c8e41346a850d553d05668e2034776dd67728c_416.log   worker-84a1f1ea262d368d2fcf54004291133e208121b8d8a44aba8127b4d3-01000000-397.err
python-core-worker-886004c26e8b90ccebf89b12b27a95fdb25bb3b92277b20e3c238812_1751.log  worker-84a1f1ea262d368d2fcf54004291133e208121b8d8a44aba8127b4d3-01000000-397.out
python-core-worker-8b0c9a3dfbc8d49d175ec19557fed98b1ce432d6cf8d0beb7c56ce4a_1787.log  worker-86aab6336547d270d307a7b1cbe0d6f62120779098ac18a6f4ca035a-01000000-1251.err
python-core-worker-8ce9dc7282e89ca249daea90ec8621a370f583fbf50fc6834430696f_1809.log  worker-86aab6336547d270d307a7b1cbe0d6f62120779098ac18a6f4ca035a-01000000-1251.out
python-core-worker-8d09fd8ec8470a67e4736a0f96fa01737e73dbc7a154deb6c89d4614_404.log   worker-8853376022c19e0abf1beb46f64b5069ba0a092ac686707eb21a4535-02000000-1843.err
python-core-worker-8dec70bd4a670b731fd2f084a0107facf9c2b3f82dc93175cdab1473_332.log   worker-8853376022c19e0abf1beb46f64b5069ba0a092ac686707eb21a4535-02000000-1843.out
python-core-worker-8efcaa2d6e4bcd8272a4cbcb8ade3cefa0e2e5edf038f503496f2c19_417.log   worker-8f21f60f2564076d3c26fd4469fcb854f9dc51af8fb60901ce9877f5-01000000-460.err
python-core-worker-8fc0a46e582e9dd03c47f6ccd356e6808ad48e5b6c7c97534c784088_915.log   worker-8f21f60f2564076d3c26fd4469fcb854f9dc51af8fb60901ce9877f5-01000000-460.out
python-core-worker-91380b46628514fd9b39add88bd302513260eaa51c8d85b00131fa18_450.log   worker-9018740c72c22360eec9e9e01f62c466f3e45435c41a32e855092adf-02000000-1863.err
python-core-worker-9315a9e465d8e5c5915c07308055f6607e814ed1086a866418ef81b3_335.log   worker-9018740c72c22360eec9e9e01f62c466f3e45435c41a32e855092adf-02000000-1863.out
python-core-worker-97bf035bc964972cd20eb0e8d36df4758a88d41986033e0fe1c8f933_353.log   worker-9194fc556b3ac37b43672c72d2d366f448b1049b42e3d654274a15a8-02000000-1787.err
python-core-worker-97df82347031322335b2563db2404721510b988b264009b87c4368b0_1824.log  worker-9194fc556b3ac37b43672c72d2d366f448b1049b42e3d654274a15a8-02000000-1787.out
python-core-worker-98d834a1641c4ec6b09b7720df3d3fb4977550dad8028342c9f1412b_1806.log  worker-92532168b65c33370136a2bd860207f9f926d7fe90013e27941a2ed0-01000000-434.err
python-core-worker-9a5b6adf53cb791fb64892360649c7c46d41a391473effabd598ca9c_2043.log  worker-92532168b65c33370136a2bd860207f9f926d7fe90013e27941a2ed0-01000000-434.out
python-core-worker-9c06c6b1e1e71bab6b4e919f8c35316c1df092a6457154fb5125b8c9_376.log   worker-93145fe2552000e4febda428a33b44fa484d17cf46c4ceb9ab1a67d6-01000000-820.err
python-core-worker-9e4db79aee34a7672b3a5d39026bb948db3e03bff85d7fa4906c7a4a_424.log   worker-93145fe2552000e4febda428a33b44fa484d17cf46c4ceb9ab1a67d6-01000000-820.out
python-core-worker-9efe015283a4ec0ec36a22508660552604c60f7bb19ef70f2831a987_1881.log  worker-969d78b3c3a95572e996ff89144d9515e999fac7f041683deb653f47-01000000-699.err
python-core-worker-a148058c01bab3fd642d41e85e406f8d7df8706ddd7c7ef0cee33aea_457.log   worker-969d78b3c3a95572e996ff89144d9515e999fac7f041683deb653f47-01000000-699.out
python-core-worker-a21ef5f681eb043616122212665dae1fce117a2247acb3cde06974be_434.log   worker-9a72d6ed14ccb4339ec19f98628ee608ecaf18e2fc0f35eab06d7ed2-01000000-406.err
python-core-worker-a270813e8b420de809a0c391b8266c0fe33bd47ba29f5b7a7125fc91_1890.log  worker-9a72d6ed14ccb4339ec19f98628ee608ecaf18e2fc0f35eab06d7ed2-01000000-406.out
python-core-worker-a59498fdbb58d34ffc5d6056c425014377507b4d4da60de501f70107_399.log   worker-9ab53beba7ea5d8dfd5fb840d5137dcda9df3d06926eb1a1cc898c52-02000000-1784.err
python-core-worker-a6179f3547f2ee2b23119aabff7602a84287563ab96925b1d2223ff0_1859.log  worker-9ab53beba7ea5d8dfd5fb840d5137dcda9df3d06926eb1a1cc898c52-02000000-1784.out
python-core-worker-aace976c7894d5acfac5ce2686b0d0a2b0e9be9681297e6330a969d3_834.log   worker-9b8efbc2a1c8fe4a38df24509253911866c46f4bd043678e3408ce93-02000000-1786.err
python-core-worker-ab3d3fa327fc8d047ec3001082cc64348338758b9787f9db9a770cc8_1808.log  worker-9b8efbc2a1c8fe4a38df24509253911866c46f4bd043678e3408ce93-02000000-1786.out
python-core-worker-ab85b75d0141a47802a862e0ec1fb675e831de883ec3c995b64b1bcc_2283.log  worker-9e7ec68e8ded81004e7e29628360019051c089f97758b522a1f23eea-02000000-1893.err
python-core-worker-ac5a213b37303cc862f81ace0c933fe5e093701aa8e5f13d2b72f6bc_1841.log  worker-9e7ec68e8ded81004e7e29628360019051c089f97758b522a1f23eea-02000000-1893.out
python-core-worker-b08a9b2c00c6e0095eaa13b91428768668d273e936ba3ed5fb809a35_929.log   worker-9f11b888dd836a0108e22d216bedd808a10a191c4a0293b1cbf78db4-02000000-1811.err
python-core-worker-b47abc0e3ab7486bd1f9331e35c0b6866bd917f43ec7be0c159e1095_393.log   worker-9f11b888dd836a0108e22d216bedd808a10a191c4a0293b1cbf78db4-02000000-1811.out
python-core-worker-b7a3bab0711f352935c7ef423954c4d2dac3c3fefc075245d72a8120_1831.log  worker-a3db11d4cf31db3ed34a5343414c646436f571919020cd25c0a8f933-01000000-416.err
python-core-worker-b849ec0f42752ca69dc749501ce3cdccbac56260dc6c5086cdc61aa9_1818.log  worker-a3db11d4cf31db3ed34a5343414c646436f571919020cd25c0a8f933-01000000-416.out
python-core-worker-b9db014d39db5961068aa2c14dbe5dda8c63db360c539b6c8c86c080_1821.log  worker-a499b52597111525b312a67b5fbbe5ec901c99e6093fca3a620e2e89-02000000-1853.err
python-core-worker-ba68554e1a6ce298d542a30a39d3f186be89b09a89b262730c660680_1850.log  worker-a499b52597111525b312a67b5fbbe5ec901c99e6093fca3a620e2e89-02000000-1853.out
python-core-worker-bcdfb815d9ab5d7aab6f349ce393d16c71fe4f33ea8f9bd6f7ff0ddf_1807.log  worker-a616fe90591b691702b71727fa2272ca77f7fec88476497b02910ded-02000000-1822.err
python-core-worker-be57587a22c3f0f967d4666514669a27a56f9b0eb215853c90d1a7b2_508.log   worker-a616fe90591b691702b71727fa2272ca77f7fec88476497b02910ded-02000000-1822.out
python-core-worker-c1031310b67076d791231aae0d78d1092b73f5fb835ee5786248b97b_407.log   worker-a69f0cefaa95d00ebc17c8f3aff614e96705c775e800d909b562d591-01000000-382.err
python-core-worker-c16e18d604392d928f17a5f184826f500149e9013744c87a22e6ef6f_1763.log  worker-a69f0cefaa95d00ebc17c8f3aff614e96705c775e800d909b562d591-01000000-382.out
python-core-worker-c28ade4c347e2cdbe4412c2a7f9a9312cb16ce26288e7f195d570c20_1863.log  worker-b11880e0fe1fea7b57d12fee93b90fb13c99a246278096813395dc0d-01000000-433.err
python-core-worker-c4c0d655d7ff49b552d1b6952306653bb9882da4cc4424b076ce9eca_356.log   worker-b11880e0fe1fea7b57d12fee93b90fb13c99a246278096813395dc0d-01000000-433.out
python-core-worker-c646d037e1a52d220033f5396224651c6888f41231f1383e79ddaa4c_1803.log  worker-b14814b32a3f2556dbaf3c18becb4020459033de6e35cf10492bf772-02000000-1764.err
python-core-worker-c73fc2b0b0d9bd2dcb9c3c1f483bbc96cdf6fea12bc9568130787c3a_1886.log  worker-b14814b32a3f2556dbaf3c18becb4020459033de6e35cf10492bf772-02000000-1764.out
python-core-worker-c8dad3bcc3adec107f36164e684e9558ffeca5c3c2e998e88cc2fb7f_1783.log  worker-b61d99a8102c2f5a058976d8071256b123029b7ecf6a1938cf4de5ad-02000000-1809.err
python-core-worker-ca76f64bf02caa1cb1acda13aabb940ee0bfa9cbb975d531f03c52b0_338.log   worker-b61d99a8102c2f5a058976d8071256b123029b7ecf6a1938cf4de5ad-02000000-1809.out
python-core-worker-cb537c845fc8c0ac158f89692b984a1bc2c86f1df9abff4bb0fce43f_397.log   worker-b8d551ed76beee51cda4698b4c35b6051aae2ad668e1676a4d644f83-01000000-442.err
python-core-worker-ccbbe9406a8cdc70edb456ac919d60bb958746c7fd9ce7b86346b6b5_453.log   worker-b8d551ed76beee51cda4698b4c35b6051aae2ad668e1676a4d644f83-01000000-442.out
python-core-worker-ce23adcccc9c192db7af1ab187c37484f008032131172347da24906c_432.log   worker-bab3a1edc5572ab2cbaaf3beb122160ad75fc8005b1135ce968aab74-01000000-379.err
python-core-worker-d10ce24510abbbb456270efd3963da9048ce0e49330f1f5e84985455_1822.log  worker-bab3a1edc5572ab2cbaaf3beb122160ad75fc8005b1135ce968aab74-01000000-379.out
python-core-worker-d2ad5b50a0de1ee95497654f638210d6223270c5a0251bcd71cfc017_361.log   worker-be786de92fe8e37d1c67af064563aedc6950e7bc7361e2b450d7b06b-01000000-394.err
python-core-worker-d35ff64ffc1c1d41f4fe0d0faf2c425fe5d07a045c1c73b3826fb45c_460.log   worker-be786de92fe8e37d1c67af064563aedc6950e7bc7361e2b450d7b06b-01000000-394.out
python-core-worker-d3bdff6d374fd5941442d5a8ab2892cbbd64d660c4d06176ef3b2f0f_390.log   worker-bf40bd5b2cc413642ec909a5d17494020e2927ccd27e635a41234144-02000000-1845.err
python-core-worker-d4220e4018591834db29ae11326eed59a3cd7059298dccb2841fe67a_1845.log  worker-bf40bd5b2cc413642ec909a5d17494020e2927ccd27e635a41234144-02000000-1845.out
python-core-worker-d4681c5df9e44772c16822ad764a8d7661499a170fc6b4784d491cae_1753.log  worker-bfec8313d7ecca35d1f0d81aedf8d8f7c704c50a5c1fe201ba71bd05-02000000-1752.err
python-core-worker-d47a894e8b5dc4e5de01d6cd66e9fa485eda0789854908314b1e1787_351.log   worker-bfec8313d7ecca35d1f0d81aedf8d8f7c704c50a5c1fe201ba71bd05-02000000-1752.out
python-core-worker-d557a5e1dbd5c6732c09a5fdb6de3b98773db661fd604caaf8c24966_392.log   worker-c0830fd8555af8b9bdfae0be5caa31c4df22dce1b142be3cb7090ccf-02000000-1867.err
python-core-worker-d5bd7087d337438de45ab6ca7236efc98b0fd3a38d81a9b713703fe9_1812.log  worker-c0830fd8555af8b9bdfae0be5caa31c4df22dce1b142be3cb7090ccf-02000000-1867.out
python-core-worker-d72178e9b8d705414eebaa4c74c5f3eda65aa532993b34161ef803af_363.log   worker-c1f004cc3cd2b5bad9100c2d5ab33ffa4ab403228c53ad87e9c165c2-02000000-1815.err
python-core-worker-d8ff933d829dc580b192c627ba88a1a3694cae32b2eecb197f43ffe8_395.log   worker-c1f004cc3cd2b5bad9100c2d5ab33ffa4ab403228c53ad87e9c165c2-02000000-1815.out
python-core-worker-d947a4e1f782291d2824865b2da63eab76d118d44ca5b8b954a6b1a4_1819.log  worker-c2ce52929d81051d9f053483e13b265e999404abf4da4e7d9bc3a579-01000000-390.err
python-core-worker-db41bd30da5a576705933ccc75fcf813caa7471561f8578510d8fafd_1764.log  worker-c2ce52929d81051d9f053483e13b265e999404abf4da4e7d9bc3a579-01000000-390.out
python-core-worker-db9c71ddb0f4c7b1e63db24086144a4b6ffb1fa42b9e55daa14f0de8_1784.log  worker-c3654d58d2871ea134a2eb7560e6d430f319fc1e8b5a97769288ebc3-02000000-1874.err
python-core-worker-dd9c449d63bfea4900e01e855d52b72a69e7b80327fbc69454f19c52_2173.log  worker-c3654d58d2871ea134a2eb7560e6d430f319fc1e8b5a97769288ebc3-02000000-1874.out
python-core-worker-dfaa599643736d8dcc00859687f090f026630932c6dd599281179156_1750.log  worker-c4185a5de42b80c28256b92f96a0f4635e295e39597e4c00184b8cfb-02000000-1861.err
python-core-worker-e07f48532b0c6026a348648f6696503c3dacf6e16c774bfe32fb2ab0_365.log   worker-c4185a5de42b80c28256b92f96a0f4635e295e39597e4c00184b8cfb-02000000-1861.out
python-core-worker-e40eef15e527d3d7bb514427005ce692912c405b6b3dd021ffbd1fd7_355.log   worker-c443d182aebd47bce1088dfdde8c6638fa6feb4ba8f459f560db5ef0-02000000-1818.err
python-core-worker-e494da638b9638c6109791ea6f1a31368d2ffd00ede4e52c511c8d63_1865.log  worker-c443d182aebd47bce1088dfdde8c6638fa6feb4ba8f459f560db5ef0-02000000-1818.out
python-core-worker-e540182f42393c0f9ab8e973b02d4450375ba5b119658538929c17a8_793.log   worker-c4ece9f2135da74c2cd51f35029141562634351519e2e97160954f5e-01000000-457.err
python-core-worker-e79d1db431f6fc35df3f22c8d779929bd4c18760d8ef6e58c2ccff76_400.log   worker-c4ece9f2135da74c2cd51f35029141562634351519e2e97160954f5e-01000000-457.out
python-core-worker-e8235e92943b5aecff5d7e80087e1c7a6def80766676dffb7dc8dd87_374.log   worker-c546e7dc1c78d4affd653552088230bc7921a9cef4bd3e90fb2b8933-01000000-332.err
python-core-worker-e8f417724fa5ec2982ab0faf08c3cfc5eb778984b6738ae2d2fe736e_2258.log  worker-c546e7dc1c78d4affd653552088230bc7921a9cef4bd3e90fb2b8933-01000000-332.out
python-core-worker-ea290d476ed0c3cd1ab4849e18ca2cc0bad4551421335880bf6beee8_1874.log  worker-c6c8a5f8913c150f380bb3a0a8661270a963546420fc8b50b7e2afef-01000000-351.err
python-core-worker-eeff8e6ec6f86317eb456e75989972aa2b201ce0dcacca4ef7fc2480_409.log   worker-c6c8a5f8913c150f380bb3a0a8661270a963546420fc8b50b7e2afef-01000000-351.out
python-core-worker-f013226d250b1a2cfb8323edd43066fd66aea38898b09228197b25a8_1251.log  worker-c857e7c0637b195e2e5b822ca50916c0e4638db29aba7932f9d96d4b-01000000-915.err
python-core-worker-f0a894750442e68bab0fdcdf843c253f40308e35d030dead3853a231_1785.log  worker-c857e7c0637b195e2e5b822ca50916c0e4638db29aba7932f9d96d4b-01000000-915.out
python-core-worker-f6a266f59d100f246443ee3d5bed2a1f30c4323f1ae9bf14e0eea2ad_1252.log  worker-c931f4bb57792f17fbe531f01c489c85f7bc6b89e5f2296f04c8fa1e-02000000-1805.err
python-core-worker-f9bbb4b15d2672fd32653fd7803201472dc6f2e0b181e7bd1c5f48bc_1870.log  worker-c931f4bb57792f17fbe531f01c489c85f7bc6b89e5f2296f04c8fa1e-02000000-1805.out
python-core-worker-fb41f3cc5adb229cba1e6b1dc3fde81826c3647919dfebc2d85619d0_1793.log  worker-ca5130cd70e085403bbcc4935e1ab55cebfaa9a70cc764cdb26c8c3a-01000000-401.err
python-core-worker-fc7b16831fea10e7701ba24620eb24f3d8df132f30b8a1c8bae2027a_1867.log  worker-ca5130cd70e085403bbcc4935e1ab55cebfaa9a70cc764cdb26c8c3a-01000000-401.out
python-core-worker-fd001cdc6384f4466aaf0206592135505369b3d21f648cb906a9e42c_352.log   worker-cafb8b66995da9ab1caddd5972800f45e917d3c104ea67cde5c3433b-01000000-832.err
python-core-worker-fddc8ee914980cc0de76211dfee31759ad3abc53002ae331a16c9330_1898.log  worker-cafb8b66995da9ab1caddd5972800f45e917d3c104ea67cde5c3433b-01000000-832.out
raylet.err                                                                            worker-ccc1ce052e63f05a46f9018d407b5da80077070be330b8edd88b82cb-02000000-1820.err
raylet.out                                                                            worker-ccc1ce052e63f05a46f9018d407b5da80077070be330b8edd88b82cb-02000000-1820.out
redis-shard_0.err                                                                     worker-cdebaf34e39a2037bb2a34a39b690f55107ec4c9f6f5dc845c291d8f-02000000-1798.err
redis-shard_0.out                                                                          

I am assuming this means a lot of worker tasks crashed?

Hmm, no it just looks like you have many workers :slight_smile: Is it possible to tail the logs to get a sense of what is failing?

1 Like

@Alex This might be object broadcasting failure.

@mslvlx if you run 10% of tasks at a time, is it still crashed? (instead of submitting all tasks at the same time, submit 10% of them, ray.get, and repeat for 10 times. I am not telling you this is a solution. I just would like to see if my hypothesis is correct).

1 Like

Thanks for the quick suggestions both!

@Alex Unfortunately I inadvertently killed the head node container before I could pull the logs.
@sangcho : I ran your suggestion, code below:

tempresult_ids = []
chunks1 = 10   
 
for y in range(0,chunks1):
    blocksize = int(len(permutlist)/chunks1)
    temppermutlist = permutlist[blocksize*y:blocksize*y+blocksize]
    temppermutlist_id = ray.put(temppermutlist)
    tempresult_ids = [calculateStuff.remote(temppermutlist_id, i) for i in range(0,len(temppermutlist))]
    tempresult = ray.get(tempresult_ids)
    allmfe = allmfe + tempresult

Runs beautifully, autoscaler responds by scaling up, no crashing issues anymore.

I think the issue is related to object broadcasting. The problem is that when you have an object that is large (let’s say 1+GB), and when they are used by other nodes, it needs to be pulled from other nodes. Then there’s going to be object transfer happening between nodes.

The problem is Ray currently has scalability limitations in object broadcasting. As you can see from here, ray/benchmarks at master · ray-project/ray · GitHub if we try to broadcast 1GB objects to 50 nodes at the same time, it crashes. I think this is what you are facing (and you resolved the issue by reducing the broadcast load to the systems by scheduling less number of tasks at a time).

We are planning to improve this workload, (maybe until it works well up to 100 nodes), but this could be a temporary solution for you. If you see some particular performance issue by this approach, please let me know! :slight_smile:

Thanks for the explanation!
I did some further testing, as long as the data chunks stay below roughly 1GB, the system is fine.

Thanks for the quick help! Much appreciated.

1 Like