Ray Cluster on-premise

Jorgen_Svane · September 30, 2023, 8:12pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hi.

Since Ray 2.4.0 I’ve been running an on-premise cluster on 5 to 6 machines. The dashboard typically looks something like this:

I launch the cluster using the ray start --head on the head-node (machine) and then using the ray start --address=head-node-address:port on subsequent machines. All machines have the same conda env installed.

I’m trying to optimize multiple hyperparameters based on data contained in 16 rather large pandas dataframes using Hyperopt. It used to work fine but recently I started to get errors similar to this issue 7697. Arguably, the dataframes have more than doubled in size over time as a number of unused (for now) columns have been added. Runtime used to be between 12 to 48 hours but now it consistently fails after completing between some 150 to 750 of the 1000 trials. As of right now, I’m trying to see if the solution given in the above issue helps but I can’t tell yet.

In addition to having grown in size the master dataframes now live on a third party server which I access using ssh for downloading them to each machine.

My question is how could I do it all using a ray up from a yaml file. Key here is to fetch the master dataframes automatically and ideally use the third party server as a fileshare for the cluster?

BR

Jorgen

Jorgen_Svane · October 1, 2023, 8:28am

I can now confirm that increasing the ulimit value as in issue 7697 solves the problem and prevents the cluster from breaking down.

I’ve done this by simply typing ulimit -n 8192 in the active conda env prior to running the ray start commands on all machines.

BR

Jorgen

Topic		Replies	Views
Ray cluster connection dies after passing large array for calculation Ray Core	7	670	March 29, 2021
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1270	October 18, 2021
Getting started with RLlib on a private cluster Ray Core	20	2781	April 28, 2021
[Core] How to reslove RayOutOfMemoryError in python for ray package? Ray Core	5	967	April 29, 2021
ConnectionError: Cannot send request due to data channel shutting down Ray Core	7	1894	August 13, 2021

Ray Cluster on-premise

Related topics