Hello! Im training a PPO model to control traffic lights to optimize flow. I have set up my network and everything works, but i cannot figure out how to set up the cluster and make every node actually do the work.
Im having real trouble finding resources on it.
So i have set up a head:
ray start --head --port=6379 --num-cpus=14 --num-gpus=1 --dashboard-host=10.10.20.49
started up my nodes:
ray start --address="10.10.20.49:6379" --num-cpus=14 --num-gpus=1
ray job submit --address="http://10.10.20.49:8265" --runtime-env-json='{"working_dir": "/traffic"}' -- python cluster2.py
I can see everything running, but i can only see that it uses resources from the head node where i submitted the task.
I can see:
Logical resource usage: 7.0/56 CPUs, 1.0/4 GPUs (0.0/4.0 accelerator_type:G)
, but thats only using the resources of my head node.
In my python file i have defined:
ray.init(address="auto")
config = training_config()
tuner = Tuner(
"PPO",
param_space=config,
run_config=RunConfig(
storage_path=storage_path,
stop={"training_iteration": 800},
checkpoint_config=CheckpointConfig(
checkpoint_frequency=3,
checkpoint_at_end=True,
),
callbacks=[
WandbLoggerCallback(
project="Traffic", api_key="f16cce37f90a1ffe6dc3741f0f86df10cc04baed", log_config=True
)]
),
)
def training_config():
ModelCatalog.register_custom_model("my_traffic_model", TrafficLightModel)
config = (
PPOConfig()
.api_stack(
enable_rl_module_and_learner=False,
enable_env_runner_and_connector_v2=False,
)
.environment(
env=TrafficLightEnv,
env_config={
"num_lights": 5,
"sumo_cfg": r"sharedDir/osm.sumocfg",
"max_steps": 18000,
},
)
.framework("torch")
.resources(num_gpus=1)
.env_runners(
num_env_runners=6,
num_envs_per_env_runner=1,
rollout_fragment_length="auto",
sample_timeout_s=320,
)
.training(
train_batch_size=16000,
num_epochs=16,
gamma=0.99,
lr=1e-4,
lambda_=0.95,
clip_param=0.2,
vf_clip_param=20.0,
entropy_coeff=0.01,
kl_coeff=0.1,
grad_clip=1.0,
)
.reporting(
metrics_num_episodes_for_smoothing=100,
keep_per_episode_custom_metrics=True,
min_sample_timesteps_per_iteration=1000,
min_time_s_per_iteration=1,
metrics_episode_collection_timeout_s=120,
)
.debugging(
logger_config={
"type": "ray.tune.logger.TBXLogger",
"logdir": "./logs",
"log_metrics_tables": True,
},
)
)
config.training(
model={
"custom_model": "my_traffic_model",
}
)
return config
From what I have gathered resources and env_runners are “local” so what each Node is going to run.
This works to run, but its only using the resources of the head node that submitted it.
I have searched all over the place and i can see things about using autoscaling, @ray.remote, placement groups, but i cannot make sense of it. What should i do here? How can i train the model across the cluster?