Tune gets stuck while using modin and RayDMatrix in lightgbm_ray

I am using lightgbm_ray where the data is first loaded as a modin dataframe, and then passed to the tune trainer function via tune.with_parameters(). Inside the trainer function, I create the RayDMatrix from the modin dataframes and call the train function of lightgbm_ray. But the program waits forever with the following logs:

021-10-25 22:43:59,408 WARNING worker.py:1227 -- The actor or task with ID 932bbc81a44209166d961627d09003555cc3ad61858d3ca8 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_ae420a08deb761bb044abcc83885c605: 1.000000}
Available resources on this node: {1.000000/4.000000 CPU, 146800639.990234 GiB/146800639.990234 GiB memory, 61833359.960938 GiB/61833359.960938 GiB object_store_memory, 0.000000/1.000000 CPU_group_0_ae420a08deb761bb044abcc83885c605, 1.000000/1.000000 node:172.30.11.204, 0.000000/3.000000 CPU_group_ae420a08deb761bb044abcc83885c605, 2000.000000/2000.000000 bundle_group_ae420a08deb761bb044abcc83885c605, 2.000000/2.000000 CPU_group_1_ae420a08deb761bb044abcc83885c605, 1000.000000/1000.000000 bundle_group_1_ae420a08deb761bb044abcc83885c605, 1000.000000/1000.000000 bundle_group_0_ae420a08deb761bb044abcc83885c605}
In total there are 4 pending tasks and 0 pending actors on this node.

Moreover, this gets printed multiple times:

== Status ==
Memory usage on this node: 11.4/238.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 9.0/12 CPUs, 0/0 GPUs, 0.0/8.4 GiB heap, 0.0/3.54 GiB objects
Result logdir: /data/lgbm_distributed_ray/repo/learning_ray_lightgbm/adidas_hpo_modin/output_modin/target_8/training_function_2021-10-25_22-43-24
Number of trials: 1/1 (1 RUNNING)
+----------------------------+----------+-------+-----------------+--------------------+--------------+
| Trial name                 | status   | loc   |   learning_rate |   min_data_in_leaf |   num_leaves |
|----------------------------+----------+-------+-----------------+--------------------+--------------|
| training_function_a3033cc0 | RUNNING  |       |       0.0869468 |                 20 |          255 |
+----------------------------+----------+-------+-----------------+--------------------+--------------+

Here is snapshot of my code:

algo = HyperOptSearch(  metric="validation-l2", 
                            mode="min",
                        )
algo = ConcurrencyLimiter(algo, max_concurrent=1)

# Ray LightGBM params
ray_params = RayParams(
    max_actor_restarts=1,
    gpus_per_actor=0,
    cpus_per_actor=2,
    num_actors=4)

# Load data and pass through tune.with_parameters
train_path = os.path.join(data_dir, f'data_train') # data dir has a bunch of partitioned .snappy.parquet files
print('-'*10, 'Loading data with MODIN', '-'*10)
train_dataset = pd.read_parquet(train_path, columns=COLUMNS) # COLUMNS is defined properly
print('-'*10, 'Modin dataframes', '-'*10)
pprint(train_dataset)


print('-'*10, 'Running Ray Tune', '-'*10)
analysis = tune.run(
    tune.with_parameters(training_function, data=(ray_params, train_dataset)),
    # Use the `get_tune_resources` helper function to set the resources.
    resources_per_trial=ray_params.get_tune_resources(),
    config={
        # LightGBM parameters
        "learning_rate": tune.uniform(0.01, 0.2)},
    search_alg=algo,
    num_samples=1,
    metric="validation-l2",
    mode="min",
)
print('-'*10, 'Done running Ray Tune', '-'*10)


# inside trainer
def training_function(config, data=None):
    ray_params, train_dataset = data
    print('-'*10, 'Creating RayDMatrix for training.', '-'*10)
    dtrain = RayDMatrix(
        train_dataset,
        label="y",  # Will select this column as the label
        distributed=True)
    ...
    
    evals_result = {}
    bst = Train(
                lgb_params,
                dtrain,
                num_boost_round=100,
                evals_result=evals_result,
                ray_params=ray_params
                )

Notes:

  1. training_function runs smoothly when I do not use it with tune.
  2. I am using Ray==1.7.0 in a Kubernetes cluster.
  3. Increasing ray cluster resources does not help.

Hey @Arindam_Jati, this is a known issue with Modin and Ray. We should be able to get it fixed in the next release of XGBoost-Ray and lightgbm-ray. In the meantime, can you try the following workaround?

Instead of using resources_per_trial=ray_params.get_tune_resources() in tune.run, can you instead do:
from ray.tune.utils.placement_groups import PlacementGroupFactory, and then set resources_per_trial=PlacementGroupFactory({[{"CPU" :1}]+(["CPU": CPUS_PER_TRIAL+1}]*NUM_ACTORS))?

1 Like

@Yard1 Thanks for the quick reply. This works! What is the reason for this extra CPU? Can you please point me to some documentation/github issue for this so that I get the insight?

Hey @Arindam_Jati, this documentation page should help - https://docs.ray.io/en/master/tune/tutorials/overview.html#why-is-my-training-stuck-and-ray-reporting-[…]-pending-actor-or-tasks-cannot-be-scheduled

1 Like