I am using lightgbm_ray
where the data is first loaded as a modin
dataframe, and then passed to the tune trainer function via tune.with_parameters()
. Inside the trainer function, I create the RayDMatrix from the modin dataframes and call the train
function of lightgbm_ray
. But the program waits forever with the following logs:
021-10-25 22:43:59,408 WARNING worker.py:1227 -- The actor or task with ID 932bbc81a44209166d961627d09003555cc3ad61858d3ca8 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_ae420a08deb761bb044abcc83885c605: 1.000000}
Available resources on this node: {1.000000/4.000000 CPU, 146800639.990234 GiB/146800639.990234 GiB memory, 61833359.960938 GiB/61833359.960938 GiB object_store_memory, 0.000000/1.000000 CPU_group_0_ae420a08deb761bb044abcc83885c605, 1.000000/1.000000 node:172.30.11.204, 0.000000/3.000000 CPU_group_ae420a08deb761bb044abcc83885c605, 2000.000000/2000.000000 bundle_group_ae420a08deb761bb044abcc83885c605, 2.000000/2.000000 CPU_group_1_ae420a08deb761bb044abcc83885c605, 1000.000000/1000.000000 bundle_group_1_ae420a08deb761bb044abcc83885c605, 1000.000000/1000.000000 bundle_group_0_ae420a08deb761bb044abcc83885c605}
In total there are 4 pending tasks and 0 pending actors on this node.
Moreover, this gets printed multiple times:
== Status ==
Memory usage on this node: 11.4/238.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 9.0/12 CPUs, 0/0 GPUs, 0.0/8.4 GiB heap, 0.0/3.54 GiB objects
Result logdir: /data/lgbm_distributed_ray/repo/learning_ray_lightgbm/adidas_hpo_modin/output_modin/target_8/training_function_2021-10-25_22-43-24
Number of trials: 1/1 (1 RUNNING)
+----------------------------+----------+-------+-----------------+--------------------+--------------+
| Trial name | status | loc | learning_rate | min_data_in_leaf | num_leaves |
|----------------------------+----------+-------+-----------------+--------------------+--------------|
| training_function_a3033cc0 | RUNNING | | 0.0869468 | 20 | 255 |
+----------------------------+----------+-------+-----------------+--------------------+--------------+
Here is snapshot of my code:
algo = HyperOptSearch( metric="validation-l2",
mode="min",
)
algo = ConcurrencyLimiter(algo, max_concurrent=1)
# Ray LightGBM params
ray_params = RayParams(
max_actor_restarts=1,
gpus_per_actor=0,
cpus_per_actor=2,
num_actors=4)
# Load data and pass through tune.with_parameters
train_path = os.path.join(data_dir, f'data_train') # data dir has a bunch of partitioned .snappy.parquet files
print('-'*10, 'Loading data with MODIN', '-'*10)
train_dataset = pd.read_parquet(train_path, columns=COLUMNS) # COLUMNS is defined properly
print('-'*10, 'Modin dataframes', '-'*10)
pprint(train_dataset)
print('-'*10, 'Running Ray Tune', '-'*10)
analysis = tune.run(
tune.with_parameters(training_function, data=(ray_params, train_dataset)),
# Use the `get_tune_resources` helper function to set the resources.
resources_per_trial=ray_params.get_tune_resources(),
config={
# LightGBM parameters
"learning_rate": tune.uniform(0.01, 0.2)},
search_alg=algo,
num_samples=1,
metric="validation-l2",
mode="min",
)
print('-'*10, 'Done running Ray Tune', '-'*10)
# inside trainer
def training_function(config, data=None):
ray_params, train_dataset = data
print('-'*10, 'Creating RayDMatrix for training.', '-'*10)
dtrain = RayDMatrix(
train_dataset,
label="y", # Will select this column as the label
distributed=True)
...
evals_result = {}
bst = Train(
lgb_params,
dtrain,
num_boost_round=100,
evals_result=evals_result,
ray_params=ray_params
)
Notes:
-
training_function
runs smoothly when I do not use it with tune. - I am using Ray==1.7.0 in a Kubernetes cluster.
- Increasing ray cluster resources does not help.