Testing Ray Cluster via Manual Setup

Gregory · January 1, 2021, 10:43pm

Happy 2021 everyone

Today I made it my goal to speed test ray with cluster setup. It didn’t go so well, maybe I can submit to the docs a use-case example like this one when I have it working?

My setup as follows:

Head Node: VPC with 4 cores, no firewall, Ray 1.1.0, Python 3.6.9
Worker Node: Google Colab with 2 cores, firewall, Ray 1.1.0, Python 3.6.9

Steps taken:

on head node - ray start --head
on worker node - ray start --address=‘8.8.8.8:6379’ --redis-password=‘0000000000000000’ # note address has been obfuscated
on head node - python custom_tf_policy.py

custom_tf_policy.py is located at https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py#L48

I changed line 48 to ray.init(address=‘auto’, _redis_password=‘0000000000000000’)
I changed line 56 to “num_workers”: 5,

Observations
The head node starts perfectly, as does connecting the worker to the head. The output looks normal when executing the script “custom_tf_policy.py”, but it doesn’t make it to training:

2021-01-01 22:08:29,908 INFO worker.py:657 – Connecting to existing Ray cluster at address: 8.8.8.8:6379
== Status ==
Memory usage on this node: 1.6/7.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 6/14 CPUs, 0/4 GPUs, 0.0/33.84 GiB heap, 0.0/10.16 GiB objects
Result logdir: /home/ray/ray_results/MyCustomTrainer_2021-01-01_22-08-29
Number of trials: 1/1 (1 RUNNING)
±----------------------------------------±---------±------+
| Trial name | status | loc |
|-----------------------------------------±---------±------|
| MyCustomTrainer_CartPole-v0_e153c_00000 | RUNNING | |
±----------------------------------------±---------±------+

There are a few INFO and WARNING messages but nothing related to workers, network, or anything unusual. A ctrl-c yields the following:

File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/tune.py”, line 419, in run
runner.step()
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 360, in step
self._process_events() # blocking
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 469, in _process_events
trial = self.trial_executor.get_next_available_trial() # blocking
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py”, line 472, in get_next_available_trial
[result_id], _ = ray.wait(shuffled_results)
File “/home/ray/.local/lib/python3.6/site-packages/ray/worker.py”, line 1513, in wait
worker.current_task_id,

Here’s the full script to save you from opening the page:

import argparse
import os

import ray
from ray import tune
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.evaluation.postprocessing import discount_cumsum
from ray.rllib.policy.tf_policy_template import build_tf_policy
from ray.rllib.utils.framework import try_import_tf

tf1, tf, tfv = try_import_tf()

parser = argparse.ArgumentParser()
parser.add_argument(“–stop-iters”, type=int, default=200)
parser.add_argument(“–num-cpus”, type=int, default=0)

def policy_gradient_loss(policy, model, dist_class, train_batch):
logits, _ = model.from_batch(train_batch)
action_dist = dist_class(logits, model)
return -tf.reduce_mean(
action_dist.logp(train_batch[“actions”]) * train_batch[“returns”])

def calculate_advantages(policy,
sample_batch,
other_agent_batches=None,
episode=None):
sample_batch[“returns”] = discount_cumsum(sample_batch[“rewards”], 0.99)
return sample_batch

MyTFPolicy = build_tf_policy(
name=“MyTFPolicy”,
loss_fn=policy_gradient_loss,
postprocess_fn=calculate_advantages,
)

MyTrainer = build_trainer(
name=“MyCustomTrainer”,
default_policy=MyTFPolicy,
)

if _name_ == “_main_”:
args = parser.parse_args()
ray.init(address=‘auto’, _redis_password=‘0000000000000000’)
tune.run(
MyTrainer,
stop={“training_iteration”: args.stop_iters},
config={
“env”: “CartPole-v0”,
# Use GPUs iff RLLIB_NUM_GPUS env var set to > 0.
“num_gpus”: int(os.environ.get(“RLLIB_NUM_GPUS”, “0”)),
“num_workers”: 5,
“framework”: “tf”,
})

sangcho · January 4, 2021, 5:40pm

When you run this driver in a head node

import ray
ray.init(address='auto')
print(ray.nodes())

Does it show worker nodes?

Gregory · January 4, 2021, 6:20pm

Yes, I’ve got 3 connected and it shows them in the JSON records when calling ray.nodes()

[{'NodeID': '18490838782c70', 'Alive': True, 'NodeManagerAddress': '8.8.8.8', 'NodeManagerHostname': 'head_node', 'NodeManagerPort': 52159, 'ObjectManagerPort': 40277, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 53464, 'alive': True, 'Resources': {'object_store_memory': 30.0, 'node:8.8.8.8': 1.0, 'memory': 89.0, 'CPU': 4.0}}, 
{'NodeID': '17a0e68d826f7', 'Alive': True, 'NodeManagerAddress': '4.4.4.4', 'NodeManagerHostname': 'worker1', 'NodeManagerPort': 65239, 'ObjectManagerPort': 34027, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 62493, 'alive': True, 'Resources': {'object_store_memory': 51.0, 'memory': 174.0, 'CPU': 2.0, 'node:172.28.0.2': 1.0}}, 
{'NodeID': '8718f31416623', 'Alive': True, 'NodeManagerAddress': '192.168.0.14', 'NodeManagerHostname': 'worker2', 'NodeManagerPort': 55809, 'ObjectManagerPort': 40157, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 40028, 'alive': True, 'Resources': {'memory': 219.0, 'accelerator_type:GTX': 1.0, 'node:192.168.0.14': 1.0, 'GPU': 4.0, 'CPU': 8.0, 'object_store_memory': 64.0}}]

sangcho · January 4, 2021, 6:29pm

Hmm maybe there’s an issue with the script in this case? cc @sven1977

Gregory · January 4, 2021, 6:46pm

I’m very thankful for the responses
And happy to submit a PR to add what we find to the docs regarding manual cluster setup.

Another user in the slack channel also said they could never get a manual cluster working.

Gregory · January 6, 2021, 3:30pm

Solved thanks to the help of a friend, the problem was ports not being open. A note to new users testing their cluster… first test with computers having all necessary ports open.

I recommend adding to the docs the need to check the following file for clues about any issues with the cluster: /tmp/ray/session_latest/logs/raylet.out

sangcho · January 22, 2021, 10:58pm

We have a port specification here; Configuring Ray — Ray v1.1.0, but I think the visibility was poor enough that you didn’t understand. What sort of document would help you to resolve this issue on your own? I’d love to hear recommendation.

Topic		Replies	Views
Ray cluster is not found at node Ray Clusters	0	172	January 11, 2024
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1700	December 15, 2020
Start Ray cluster with error but working Ray Clusters	15	1046	July 4, 2022
How to use ray for multi-node training? Ray Clusters	1	1604	November 22, 2022
Getting started with RLlib on a private cluster Ray Core	20	2761	April 28, 2021

Testing Ray Cluster via Manual Setup

Related topics