Ray on Yarn (MapR - failing to get RAY_HEAD_ADDRESS)

Hi all,

I am trying to deploy ray on Yarn on a MapR cluster. First I patched out the “logged in”-verification from skein as described by @jcrist in MapR non-standard Hadoop security not supported · Issue #70 · dask/dask-yarn · GitHub and now I am trying to get the example to run.

I have tinkered a bit with skein (I have been successfully able to launch simple “hello world” applications and managed to run/scale some toy examples with dask_yarn) with initial success, however, any of my attempts to run the ray on yarn example so far failed (I have reduced the number of workers to 2 in below examples).

The log on the head looks so far so good:

2021-07-06 13:51:09,594 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2021-07-06 13:51:07,042 INFO scripts.py:560 -- Local node IP: 10.183.210.98
2021-07-06 13:51:10,620 SUCC scripts.py:592 -- --------------------
2021-07-06 13:51:10,620 SUCC scripts.py:593 -- Ray runtime started.
2021-07-06 13:51:10,620 SUCC scripts.py:594 -- --------------------
2021-07-06 13:51:10,620 INFO scripts.py:596 -- Next steps
2021-07-06 13:51:10,620 INFO scripts.py:597 -- To connect to this Ray runtime from another node, run
2021-07-06 13:51:10,620 INFO scripts.py:601 --   ray start --address='10.183.210.98:16379' --redis-password='5241590000000000'
2021-07-06 13:51:10,620 INFO scripts.py:606 -- Alternatively, use the following Python code:
2021-07-06 13:51:10,621 INFO scripts.py:609 -- import ray
2021-07-06 13:51:10,621 INFO scripts.py:610 -- ray.init(address='auto', _redis_password='5241590000000000')
2021-07-06 13:51:10,621 INFO scripts.py:618 -- If connection fails, check your firewall settings and network configuration.
2021-07-06 13:51:10,621 INFO scripts.py:623 -- To terminate the Ray runtime, run
2021-07-06 13:51:10,621 INFO scripts.py:624 --   ray stop
10.183.210.98   # <-- i just echo'd the RAY_HEAD_ADDRESS here to see whether it got properly stored in the skein KV
2021-07-06 13:51:12,233 INFO worker.py:735 -- Connecting to existing Ray cluster at address: 10.183.210.98:16379
Iteration 0
Counter({('[redacted]', '[redacted]'): 100})
Iteration 1
Counter({('[redacted]', '[redacted]'): 100})
Iteration 2
Counter({('[redacted]', '[redacted]'): 100})
Iteration 3
Counter({('[redacted]', '[redacted]'): 100})
Iteration 4
Counter({('[redacted]', '[redacted]'): 100})
Iteration 5
Counter({('[redacted]', '[redacted]'): 100})
Iteration 6
Counter({('[redacted]', '[redacted]'): 100})
Iteration 7
Counter({('[redacted]', '[redacted]'): 100})
Iteration 8
Counter({('[redacted]', '[redacted]'): 100})
Iteration 9
Counter({('[redacted]', '[redacted]'): 100})
Success!

because one of the workers was started on the same node as the head. However, on both worker nodes I see that the RAY_HEAD_ADDRESS apears not set in the current skein kv store (note: skein recognises that it runs in a container, since otherwise a different error would be raised):

Error: Key 'RAY_HEAD_ADDRESS' is not set  # <- gets treated as 0.0.0.0 on the worker spawned on the same node as the head and works fine there, but fails on the second worker on another node
[...]
redis.exceptions.ConnectionError: Error 111 connecting to 0.0.0.0:16379. Connection refused.

Since I also had to adapt the example.py in the repository (removing the driver_object_store_memory-keyword argument from ray.init), I am wondering whether anyone here might have more uptodate instructions on how to get ray running on yarn? Is there something that has to be done/modified to be able to access the skein kv store from the worker nodes? Or should this work out of the box? Unfortunately I do not know much about yarn itself (or skein for that matter), so any hints would be much appreciated!

The fix turned out rather simple:

skein kv get is non-blocking, adding skein kv get --wait to it solves the issue and the example runs fine in a multi-node setting.