Hi, I have a ray cluster running on k8s (following this doc: Deploying on Kubernetes — Ray v2.0.0.dev0). Then I have a standalone Java application (e.g. running from my laptop) trying to send tasks to this cluster. My cluster exposes all necessary ports (e.g. 6379, 10001) but it doesn’t work and throw below exceptions:
10:14:42.560 [main] ERROR i.r.runtime.DefaultRayRuntimeFactory - Failed to initialize ray runtime, with config {“ray”:{“address”:“some_public_ip:6379”,“head-args”:[],“job”:{“code-search-path”:"/Users/baolins/Documents/tongyu/cmsc_valuation_demo/analytics-py-bct/bct/distributed/ray",“id”:"",“jvm-options”:[],“num-java-workers-per-process”:1,“worker-env”:{}},“logging”:{“dir”:"",“level”:“INFO”,“max-backup-files”:10,“max-file-size”:“500MB”,“pattern”:"%d{yyyy-MM-dd HH:mm:ss,SSS} %p %c{1} [%t]: %m%n"},“object-store”:{“socket-name”:null},“raylet”:{“node-manager-port”:0,“socket-name”:null},“redis”:{“password”:“5241590000000000”},“run-mode”:“CLUSTER”,“session-dir”:"/tmp/ray/session_2021-08-23_05-45-47_185807_115"}}
java.lang.RuntimeException: Failed to get address info. Output: null
at io.ray.runtime.runner.RunManager.getAddressInfoAndFillConfig(RunManager.java:88)
at io.ray.runtime.RayNativeRuntime.start(RayNativeRuntime.java:79)
at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:39)
at io.ray.api.Ray.init(Ray.java:39)
at io.ray.api.Ray.init(Ray.java:26)
at tech.tongyu.riskcalc.valuation.remote.RemotePricing.rpcCollect(RemotePricing.kt:92)
at tech.tongyu.TestKt.main(test.kt:15)
at tech.tongyu.TestKt.main(test.kt)
Caused by: java.lang.RuntimeException: The exit value of the process is 134. Command: python -c import ray; print(ray._private.services.get_address_info_from_redis(‘some_public_ip:6379’, ‘192.168.31.241’, redis_password=‘5241590000000000’, log_warning=False))
output:
[2021-08-24 10:14:42,553 C 24750 5439505] service_based_gcs_client.cc:228: Couldn’t reconnect to GCS server. The last attempted GCS server address was 10.42.1.241:40884
*** StackTrace Information ***
@ 0x10e376360 ray::SpdLogMessage::Flush()
@ 0x10e34b0c9 ray::RayLog::~RayLog()
@ 0x10e01dfdc ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
@ 0x10e01d73c ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
@ 0x10e31747e ray::PeriodicalRunner::DoRunFnPeriodically()
@ 0x10e31737b ray::PeriodicalRunner::RunFnPeriodically()
@ 0x10e01cbab ray::gcs::ServiceBasedGcsClient::Connect()
@ 0x10de4453f ray::gcs::GlobalStateAccessor::Connect()
@ 0x10de2356b __pyx_pw_3ray_7_raylet_19GlobalStateAccessor_3connect()
@ 0x10cb2a83f method_vectorcall_NOARGS
@ 0x10cc51bc3 _PyEval_EvalFrameDefault
@ 0x10cc44ebd _PyEval_EvalCodeWithName
@ 0x10cb1bcba _PyFunction_Vectorcall
@ 0x10cc51bc3 _PyEval_EvalFrameDefault
@ 0x10cb1bc08 _PyFunction_Vectorcall
@ 0x10cc51bc3 _PyEval_EvalFrameDefault
@ 0x10cb1bc08 _PyFunction_Vectorcall
@ 0x10cc51bc3 _PyEval_EvalFrameDefault
@ 0x10cc44ebd _PyEval_EvalCodeWithName
@ 0x10cb1bcba _PyFunction_Vectorcall
@ 0x10cc522f5 _PyEval_EvalFrameDefault
@ 0x10cc44ebd _PyEval_EvalCodeWithName
@ 0x10cb1bcba _PyFunction_Vectorcall
@ 0x10cc522f5 _PyEval_EvalFrameDefault
@ 0x10cc44ebd _PyEval_EvalCodeWithName
@ 0x10ccc0323 PyRun_StringFlags
@ 0x10ccc0172 PyRun_SimpleStringFlags
@ 0x10cce5856 pymain_run_command
@ 0x10cce471d pymain_run_python
@ 0x10cce44e5 Py_RunMain
@ 0x10cce5bf1 pymain_main
@ 0x10caee328 main
at io.ray.runtime.runner.RunManager.runCommand(RunManager.java:115)
at io.ray.runtime.runner.RunManager.getAddressInfoAndFillConfig(RunManager.java:78)
… 7 common frames omitted
It seems the client is trying to connect ray with its internal ip (e.g. 10.xx.xx.xx), but this use case seems pretty common to me.
Any suggestions? Thanks a lot.
-BS