Testing Ray Cluster via Manual Setup

Happy 2021 everyone :smiley:

Today I made it my goal to speed test ray with cluster setup. It didn’t go so well, maybe I can submit to the docs a use-case example like this one when I have it working?

My setup as follows:

  1. Head Node: VPC with 4 cores, no firewall, Ray 1.1.0, Python 3.6.9
  2. Worker Node: Google Colab with 2 cores, firewall, Ray 1.1.0, Python 3.6.9

Steps taken:

  1. on head node - ray start --head
  2. on worker node - ray start --address=‘8.8.8.8:6379’ --redis-password=‘0000000000000000’ # note address has been obfuscated
  3. on head node - python custom_tf_policy.py

custom_tf_policy.py is located at https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py#L48

I changed line 48 to ray.init(address=‘auto’, _redis_password=‘0000000000000000’)
I changed line 56 to “num_workers”: 5,

Observations
The head node starts perfectly, as does connecting the worker to the head. The output looks normal when executing the script “custom_tf_policy.py”, but it doesn’t make it to training:

2021-01-01 22:08:29,908 INFO worker.py:657 – Connecting to existing Ray cluster at address: 8.8.8.8:6379
== Status ==
Memory usage on this node: 1.6/7.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 6/14 CPUs, 0/4 GPUs, 0.0/33.84 GiB heap, 0.0/10.16 GiB objects
Result logdir: /home/ray/ray_results/MyCustomTrainer_2021-01-01_22-08-29
Number of trials: 1/1 (1 RUNNING)
±----------------------------------------±---------±------+
| Trial name | status | loc |
|-----------------------------------------±---------±------|
| MyCustomTrainer_CartPole-v0_e153c_00000 | RUNNING | |
±----------------------------------------±---------±------+

There are a few INFO and WARNING messages but nothing related to workers, network, or anything unusual. A ctrl-c yields the following:

File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/tune.py”, line 419, in run
runner.step()
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 360, in step
self._process_events() # blocking
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 469, in _process_events
trial = self.trial_executor.get_next_available_trial() # blocking
File “/home/ray/.local/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py”, line 472, in get_next_available_trial
[result_id], _ = ray.wait(shuffled_results)
File “/home/ray/.local/lib/python3.6/site-packages/ray/worker.py”, line 1513, in wait
worker.current_task_id,

Here’s the full script to save you from opening the page:

import argparse
import os

import ray
from ray import tune
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.evaluation.postprocessing import discount_cumsum
from ray.rllib.policy.tf_policy_template import build_tf_policy
from ray.rllib.utils.framework import try_import_tf

tf1, tf, tfv = try_import_tf()

parser = argparse.ArgumentParser()
parser.add_argument(“–stop-iters”, type=int, default=200)
parser.add_argument(“–num-cpus”, type=int, default=0)

def policy_gradient_loss(policy, model, dist_class, train_batch):
logits, _ = model.from_batch(train_batch)
action_dist = dist_class(logits, model)
return -tf.reduce_mean(
action_dist.logp(train_batch[“actions”]) * train_batch[“returns”])

def calculate_advantages(policy,
sample_batch,
other_agent_batches=None,
episode=None):
sample_batch[“returns”] = discount_cumsum(sample_batch[“rewards”], 0.99)
return sample_batch

MyTFPolicy = build_tf_policy(
name=“MyTFPolicy”,
loss_fn=policy_gradient_loss,
postprocess_fn=calculate_advantages,
)

MyTrainer = build_trainer(
name=“MyCustomTrainer”,
default_policy=MyTFPolicy,
)

if _name_ == “_main_”:
args = parser.parse_args()
ray.init(address=‘auto’, _redis_password=‘0000000000000000’)
tune.run(
MyTrainer,
stop={“training_iteration”: args.stop_iters},
config={
“env”: “CartPole-v0”,
# Use GPUs iff RLLIB_NUM_GPUS env var set to > 0.
“num_gpus”: int(os.environ.get(“RLLIB_NUM_GPUS”, “0”)),
“num_workers”: 5,
“framework”: “tf”,
})

When you run this driver in a head node

import ray
ray.init(address='auto')
print(ray.nodes())

Does it show worker nodes?

Yes, I’ve got 3 connected and it shows them in the JSON records when calling ray.nodes()

[{'NodeID': '18490838782c70', 'Alive': True, 'NodeManagerAddress': '8.8.8.8', 'NodeManagerHostname': 'head_node', 'NodeManagerPort': 52159, 'ObjectManagerPort': 40277, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 53464, 'alive': True, 'Resources': {'object_store_memory': 30.0, 'node:8.8.8.8': 1.0, 'memory': 89.0, 'CPU': 4.0}}, 
{'NodeID': '17a0e68d826f7', 'Alive': True, 'NodeManagerAddress': '4.4.4.4', 'NodeManagerHostname': 'worker1', 'NodeManagerPort': 65239, 'ObjectManagerPort': 34027, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 62493, 'alive': True, 'Resources': {'object_store_memory': 51.0, 'memory': 174.0, 'CPU': 2.0, 'node:172.28.0.2': 1.0}}, 
{'NodeID': '8718f31416623', 'Alive': True, 'NodeManagerAddress': '192.168.0.14', 'NodeManagerHostname': 'worker2', 'NodeManagerPort': 55809, 'ObjectManagerPort': 40157, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-04_18-05-21_391764_1834/sockets/raylet', 'MetricsExportPort': 40028, 'alive': True, 'Resources': {'memory': 219.0, 'accelerator_type:GTX': 1.0, 'node:192.168.0.14': 1.0, 'GPU': 4.0, 'CPU': 8.0, 'object_store_memory': 64.0}}]

Hmm maybe there’s an issue with the script in this case? cc @sven1977

I’m very thankful for the responses :pray:
And happy to submit a PR to add what we find to the docs regarding manual cluster setup.

Another user in the slack channel also said they could never get a manual cluster working.

Solved thanks to the help of a friend, the problem was ports not being open. A note to new users testing their cluster… first test with computers having all necessary ports open.

I recommend adding to the docs the need to check the following file for clues about any issues with the cluster: /tmp/ray/session_latest/logs/raylet.out

2 Likes

We have a port specification here; Configuring Ray — Ray v1.1.0, but I think the visibility was poor enough that you didn’t understand. What sort of document would help you to resolve this issue on your own? I’d love to hear recommendation.