Ray Cluster on-premise

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hi.

Since Ray 2.4.0 I’ve been running an on-premise cluster on 5 to 6 machines. The dashboard typically looks something like this:

image

I launch the cluster using the ray start --head on the head-node (machine) and then using the ray start --address=head-node-address:port on subsequent machines. All machines have the same conda env installed.

I’m trying to optimize multiple hyperparameters based on data contained in 16 rather large pandas dataframes using Hyperopt. It used to work fine but recently I started to get errors similar to this issue 7697. Arguably, the dataframes have more than doubled in size over time as a number of unused (for now) columns have been added. Runtime used to be between 12 to 48 hours but now it consistently fails after completing between some 150 to 750 of the 1000 trials. As of right now, I’m trying to see if the solution given in the above issue helps but I can’t tell yet.

In addition to having grown in size the master dataframes now live on a third party server which I access using ssh for downloading them to each machine.

My question is how could I do it all using a ray up from a yaml file. Key here is to fetch the master dataframes automatically and ideally use the third party server as a fileshare for the cluster?

BR

Jorgen

I can now confirm that increasing the ulimit value as in issue 7697 solves the problem and prevents the cluster from breaking down.

I’ve done this by simply typing ulimit -n 8192 in the active conda env prior to running the ray start commands on all machines.

BR

Jorgen