Scalability of ray w.r.t. the number of remote workers

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
[O] Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.42
  • Python version: 3.12
  • OS: Windows

3. What happened vs. what you expected:

  • Expected: linear scaling
  • Actual: non-linear scaling

I am testing if ray training scales w.r.t. the number of env_runners in rllib. Ultimately, I need to load a heavier environment, but I am doing a few tests on a lighter environment such as CartPole-v1. Here is my setup:

num_env_runners = int(input("Enter number of env runners: "))
# step_size = int(input("Enter number of step_size: "))
train_batch_size = int(input("Enter number of train_batch_size: "))
num_iterations = int(input("Enter number of num_iterations: "))

storage_path = SOME_PATH

ppo_config = (
    PPOConfig()
    .environment(
        env="CartPole-v1",
    )
    .training(
        model={
            "fcnet_hiddens": [256, 256]
            },
        train_batch_size_per_learner=train_batch_size
    )
    .env_runners(
        num_env_runners=num_env_runners,
        batch_mode="truncate_episodes",
    )
    .learners(
    )
    .reporting(
        # min_sample_timesteps_per_iteration=step_size
    )
    .multi_agent(
        count_steps_by='agent_steps'
    )
    # TODO(@chungs4): Make experiment reproducible
    .debugging(
        seed=5
    )
)

config_to_dict = ppo_config.to_dict()

tuner = ray.tune.Tuner(
    "PPO",
    param_space=config_to_dict,
    run_config=ray.tune.RunConfig(
        storage_path=storage_path,
        # checkpoint_config=ray.tune.CheckpointConfig(checkpoint_frequency=3),
        stop={
            'training_iteration': num_iterations
        },
        verbose=1
    )
)

result = tuner.fit()

I experimented various num_env_runners, train_batch_size but training time never seems to scale up linearly to num_env_runners. This is an experiment result with train_batch_size=10000 (I tried as low as 128, which is the smallest size that train_batch_size can be):

num_env_runners=1 => total_time=340s
num_env_runners=2 => total_time=290s
num_env_runners=3 => total_time=227s
num_env_runners=4 => total_time=253s

Since a heavier environment will be loaded in the future, I am considering loading 1 env per runner as of now. Any idea which parameters I should consider changing? Thank you in advance.