Let me first explain my use case. I have a Java application, which submit tasks to a Python Ray cluster running on K8s (following this link: Deploying on Kubernetes — Ray v2.0.0.dev0). In this case, the java app is really a “client” and I have solved all serialization issues. It works if I set up a local cluster but get confusing error messages when using the k8s one.
First of all, if I run my java app from my laptop, it throws below exception when calling Ray.init(). To me, it looks like my java app is treated as a worker node, but it’s really is just a client. I don’t find any other ‘client’ API I could use in Java like ‘ray.client().connect()’ in Python.
05:27:44.043 [DefaultDispatcher-worker-1] INFO c.c.r.v.RemoteSingleAssetCmsLocalVolPA w/interface - sending request to grid for trade100
05:27:56.774 [main] ERROR i.r.runtime.DefaultRayRuntimeFactory - Failed to initialize ray runtime, with config {"ray":{"address":"10.23.113.84:6379","head-args":[],"job":{"code-search-path":"/home/ray/analytics-py-bct/bct/distributed/ray","id":"","jvm-options":[],"num-java-workers-per-process":1,"worker-env":{}},"logging":{"dir":"","level":"INFO","max-backup-files":10,"max-file-size":"500MB","pattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p %c{1} [%t]: %m%n"},"object-store":{"socket-name":null},"raylet":{"node-manager-port":0,"socket-name":null},"redis":{"password":"5241590000000000"},"run-mode":"CLUSTER","session-dir":"/tmp/ray/session_2021-08-24_03-22-32_402612_114"}}
java.lang.RuntimeException: Failed to get address info. Output: null
at io.ray.runtime.runner.RunManager.getAddressInfoAndFillConfig(RunManager.java:88)
at io.ray.runtime.RayNativeRuntime.start(RayNativeRuntime.java:79)
at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:39)
at io.ray.api.Ray.init(Ray.java:39)
at io.ray.api.Ray.init(Ray.java:26)
...
Caused by: java.lang.RuntimeException: The exit value of the process is 1. Command: python -c import ray; print(ray._private.services.get_address_info_from_redis('10.23.113.84:6379', '10.42.9.133', redis_password='5241590000000000', log_warning=False))
output:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 310, in get_address_info_from_redis
redis_address, node_ip_address, redis_password=redis_password)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 284, in get_address_info_from_redis_helper
f"This node has an IP address of {node_ip_address}, and Ray "
RuntimeError: This node has an IP address of xxxx, and Ray expects this IP address to be either the Redis address or one of the Raylet addresses. Connected to Redis at 10.23.113.84:6379 and found raylets at ... but none of these match this node's IP 10.42.9.133. Are any of these actually a different IP address for the same node?You might need to provide --node-ip-address to specify the IP address that the head should use when sending to this node.
Then I also tried running my java app in one of worker Pod from my k8s cluster. In this case, it indeed is able to connect to the cluster by Ray.init() and sending tasks. But from what I observed from dashboard, all tasks (in my case hundreds of ) are all scheduled to the node my java app is running and it won’t take long that node is crashed due to out of memory.
I would say my use case is probably the most common one for a distributed computing scenario and it should be easily achieved. Would anyone shed some lights on how I should do this Ray? I could provide more detail regarding to my use case or the error, if needed.
Thanks,
-BS