All trials PENDING, never RUNNING

rkechols · July 2, 2021, 5:28pm

I’m trying to run a hyperparameter search with PyTorch Lightning, but it doesn’t seem like any of the trials are ever actually started. The CLI reporter only ever shows the trials as PENDING, and they never change to RUNNING. The CLI reporter output always stays the same, looking like this:

== Status ==
Memory usage on this node: 1.3/12.3 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/7.21 GiB heap, 0.0/3.61 GiB objects
Result logdir: /home/echols14/ray_results/tune_training_asha
Number of trials: 10/10 (10 PENDING)
+-------------------+----------+-------+-------------+--------------+-----------+-----------+
| Trial name        | status   | loc   |          lr |   batch_size |   dropout |   d_model |
|-------------------+----------+-------+-------------+--------------+-----------+-----------|
| train_46f66_00000 | PENDING  |       | 0.000134998 |          256 | 0.155438  |       128 |
| train_46f66_00001 | PENDING  |       | 0.00143964  |          512 | 0.170543  |       128 |
| train_46f66_00002 | PENDING  |       | 9.20564e-05 |          512 | 0.325629  |       256 |
| train_46f66_00003 | PENDING  |       | 0.000815963 |          512 | 0.152794  |       128 |
| train_46f66_00004 | PENDING  |       | 3.95896e-05 |          512 | 0.0834858 |       512 |
| train_46f66_00005 | PENDING  |       | 0.00305493  |           64 | 0.0672461 |       128 |
| train_46f66_00006 | PENDING  |       | 3.34226e-05 |          512 | 0.116613  |       512 |
| train_46f66_00007 | PENDING  |       | 0.000262342 |          512 | 0.156197  |       512 |
| train_46f66_00008 | PENDING  |       | 3.96747e-05 |          256 | 0.142566  |       512 |
| train_46f66_00009 | PENDING  |       | 7.67698e-05 |          128 | 0.328324  |       128 |
+-------------------+----------+-------+-------------+--------------+-----------+-----------+

Here’s the function that kicks it all off:

def tune_hyperparams(config: Dict, num_samples=10, num_epochs=10, gpus_per_trial=1):
    """call ray.tune methods to find the best hyperparameters
    Parameters
    ----------
    config : Dict
        a dictionary of configuration variables to be passed into train_heatmap_predictor.
        values associated with certain keys will by written over by ray.tune for tuning
    num_samples : int, optional
        how many times to run each configuration, by default 10
    num_epochs : int, optional
        the number of epochs to run each trial, by default 10
    gpus_per_trial : int, optional
        the number of GPUs each trial can use, by default 1
    """
    tune_config = {
        # "tune": True,  # triggers tuning things in the train method
        "batch_size": tune.choice([64, 128, 256, 512]),
        "lr": tune.loguniform(1e-5, 1e-2),
        "d_model": tune.choice([128, 256, 512]),
        "n_heads": tune.choice([4, 6, 8]),
        "dim_transf_ff": tune.choice([256, 512, 1024, 2048]),
        "n_transf_layers": tune.choice([4, 6, 8]),
        "dropout": tune.uniform(0.05, 0.35),
    }
    config.update(tune_config)  # any shared values will be overwritten by the tune value

    scheduler = ASHAScheduler(max_t=num_epochs, grace_period=1, reduction_factor=2)
    reporter = CLIReporter(
        parameter_columns=["lr", "batch_size", "dropout", "d_model"],
        metric_columns=["train_loss", "val_loss", "val_accuracy", "training_iteration"])

    analysis = tune.run(
        tune.with_parameters(train, num_epochs=num_epochs, num_gpus=gpus_per_trial, hyperparam_tuning=True),
        resources_per_trial={"cpu": 1, "gpu": gpus_per_trial},
        metric="accuracy",
        mode="max",
        # config=config,
        config=tune_config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter,
        name="tune_training_asha")
    print("Best hyperparameters found were: ", analysis.best_config)

And it references this function as its trainable, but since IN THE TRAIN FUNCTION doesn’t appear anywhere in the output, I don’t think it’s getting called at all.

def train(config: Dict, num_epochs=10, num_gpus=0, hyperparam_tuning=False):
    """this function can be called by a ray.tune hyperparameter search"""
    print("IN THE TRAIN FUNCTION")
    # general setup
    if not os.path.isdir(config["model_dir"]):
        os.mkdir(config["model_dir"])
    seed_everything(config["seed"], workers=True)
    # get data
    dataset_train = PaletteDataset(config["train_data"])
    dataloader_train = DataLoader(dataset_train, config["batch_size"], shuffle=True,
        num_workers=config["cpus"])
    dataset_val = PaletteDataset(config["val_data"])
    dataloader_val = DataLoader(dataset_val, config["batch_size"], shuffle=False,
        num_workers=config["cpus"])
    # make the model
    model = PaletteToScalarPL(config)
    callback_list = list()
    # set up custom checkpointing
    if hyperparam_tuning:
        tune_callback = TuneReportCallback(
            metrics={
                "loss": "ptl/val_loss",
                "accuracy": "ptl/val_accuracy"
            },
            on="validation_end"
        )
        callback_list.append(tune_callback)
        tb_logger = TensorBoardLogger(save_dir=tune.get_trial_dir(), name="", version=".")
    else:
        tb_logger = TensorBoardLogger(save_dir=config["model_dir"], name="lightning_logs")
        checkpoint_callback = ModelCheckpoint(
                monitor="val_accuracy",
                dirpath=os.path.join(tb_logger.log_dir, "checkpoints"),
                filename="{epoch}-{val_accuracy:.2f}",
                save_top_k=1,
                mode="max",
            )
        callback_list.append(checkpoint_callback)
    # send outputs to where sagemaker expects them (or the provided dir)
    output_data_dir = config["output_data_dir"]
    # train it
    trainer = Trainer(max_epochs=num_epochs, gpus=num_gpus, deterministic=True,
        logger=tb_logger, callbacks=callback_list,
        default_root_dir=output_data_dir)
    trainer.fit(model, train_dataloader=dataloader_train, val_dataloaders=dataloader_val)
    if not hyperparam_tuning:
        with open(os.path.join(config["model_dir"], "final_p_to_scalar.pth"), "wb") as f:
            torch.save(model.p_to_scalar.state_dict(), f)

Any ideas why no training jobs are starting?

rliaw · July 2, 2021, 5:30pm

Hmm, it seems like you have no GPUs?

Resources requested: 0/8 CPUs, 0/0 GPUs

Can you provide an output of the following two commands:
torch.cuda.is_available() and ray.init(); ray.cluster_resources()

cc @kai , it seems like this error comes up quite often.

rkechols · July 2, 2021, 5:38pm

if __name__ == "__main__":
    print("cuda:", torch.cuda.is_available())
    ray.init()
    print("ray:", ray.cluster_resources())
    exit()

output:

cuda: False
2021-07-02 11:34:38,499	INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
ray: {'CPU': 8.0, 'node:172.18.111.32': 1.0, 'object_store_memory': 3871602278.0, 'memory': 7743204558.0}

So you’re right that it’s not finding my GPU. This may have to do with my WSL environment. I’ll try running it outside of WSL

rkechols · July 2, 2021, 6:05pm

torch.cuda.is_available() returns False from within WSL. True from typical Windows environment. False from the project’s virtual environment.
So I suppose I need to somehow get my venv access to the GPU.

rkechols · July 2, 2021, 6:44pm

After much messing around with the Windows virtual environment, I’ve gotten it to the point where torch.cuda.is_available() returns True.

But now I have a totally different problem: calling ray.init() (and my original script) outputs this line:

2021-07-02 12:41:47,233	INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265

then hangs.

I’m using ray 1.4.1

Created a new discussion: Ray.init() hangs

rliaw · July 7, 2021, 5:12pm

Hmm, so this is WSL?

Can you, in a separate terminal, run something like py-spy --native dump ?

rkechols · July 8, 2021, 6:13am

So in WSL ray.init() does not hang, but the GPU is not available with CUDA. I believe this lack of access to the GPU is what was causing the original problem. It’s when I switched to normal Windows to get access to the GPU that this problem was no longer an issue, but then I ran into the issue that ray.init() hangs on Windows.

Did you still want me to run py-spy on something? If so, WSL or Windows?

rliaw · July 12, 2021, 7:03pm

Hey @rkechols, py-spy would be good to run on Windows. This would help us diagnose the hang!

rliaw · July 12, 2021, 7:18pm

BTW, @rkechols this might also be helpful: Ray jobs stuck pending in docker container when using GPU on mnist example - #4 by Christopher_Fusting.

I would try to get WSL cuda working if possible (i guess it requires WSL2 though)

rkechols · July 12, 2021, 10:01pm

Getting py-spy to work was fairly difficult, but here’s what happened:

$ py-spy --native dump
error: Found argument '--native' which wasn't expected, or isn't valid in this context
        Did you mean to put '--native' after the subcommand 'dump'?

$ py-spy dump --native
error: The following required arguments were not provided:
    --pid <pid>

USAGE:
    py-spy.exe dump --native --pid <pid>

$ py-spy dump --native --pid 722
Error: Failed to open process - check if it is running.
Reason: The parameter is incorrect. (os error 87)

$ py-spy dump --native --pid 11184
Error: Failed to find python version from target process

$ py-spy dump --native --pid 2784
Process 2784: C:\Users\ryan\.conda\envs\ray\python.exe -u "C:\Users\ryan\Downloads\temp\venv\lib\site-packages\ray\new_dashboard/agent.py" "--node-ip-address=192.168.1.6" "--redis-address=192.168.1.6:6379" --metrics-export-port=62770 --dashboard-agent-port=61938 --node-manager-port=57215 "--object-store-name=tcp://127.0.0.1:63871" "--raylet-name=tcp://127.0.0.1:63811" "--temp-dir=C:\Users\ryan\AppData\Local\Temp\ray" "--log-dir=C:\Users\ryan\AppData\Local\Temp\ray\session_2021-07-12_15-57-10_271879_15428\logs" --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 4028 (active): "MainThread"
    NtWaitForSingleObject (ntdll.dll)
    WaitForSingleObjectEx (KERNELBASE.dll)
    <module> (ray\new_dashboard\agent.py:311)
Thread 13828 (active): "Thread-1"
    NtRemoveIoCompletion (ntdll.dll)
    GetQueuedCompletionStatus (KERNELBASE.dll)
    0x7ff92aed9443 (grpc\_cython\cygrpc.cp39-win_amd64.pyd)
    0x7ff92aedbb4c (grpc\_cython\cygrpc.cp39-win_amd64.pyd)
    0x7ff92af34283 (grpc\_cython\cygrpc.cp39-win_amd64.pyd)
    PyInit_cygrpc (grpc\_cython\cygrpc.cp39-win_amd64.pyd)
    PyInit_cygrpc (grpc\_cython\cygrpc.cp39-win_amd64.pyd)
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)
    PyThread_init_thread (python39.dll)
    configthreadlocale (ucrtbase.dll)

The 722 and 11184 PIDs are because it’s fairly difficult to find the right python process while ray is hanging.

rliaw · July 13, 2021, 1:15am

Hmm, you py-spyed the dashboard process. Is it possible to py-spy the python script where you’re running your Tune code?

rkechols · July 13, 2021, 9:27pm

I’m not sure which PID to send to py-spy, since there are a number of Python processes running while ray.init() is hanging.

To be sure I’m getting the right PID I tried making a python file test.py

import ray
ray.init()
print("done")

And then running this in the terminal:

$ python test.py &
[1] 836
$ jobs -l
[1]+   836 Running                 python test.py &

To me this says that the PID is 836, but then I get this:

$ py-spy dump --native --pid 836
Error: Failed to open process - check if it is running.
Reason: The parameter is incorrect. (os error 87)

Or if I start it with this so that the correct python process is identifiable with windows tasklist:

$ start "my_python_window" python test.py`

then I can find out the PID using this:

$ tasklist /V /FI "IMAGENAME eq python.exe"

I can get 18736 for the PID, which then gives this:

$ py-spy dump --native --pid 18736
Error: Failed to find python version from target process

Or if I pick a random python process, sometimes I get this:

$ py-spy dump --native --pid 17420
Process 17420: C:\Users\ryan\.conda\envs\ray\python.exe test.py
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Error: Failed to merge native and python frames (Have 8 native and 4 python)

Please instruct further on how to use py-spy to get the output you’d like.

rliaw · July 13, 2021, 10:12pm

Hmm, Ryan, what if you didn’t use --native and try the python test.py & again?

rkechols · July 14, 2021, 1:16am

$ python test.py &
[1] 1019
$ 2021-07-13 18:28:48,331       INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
**HANGS**

Then from another terminal:

$ py-spy dump --pid 1019
Error: Failed to open process - check if it is running.
Reason: The parameter is incorrect. (os error 87)

OR

$ start "my_python_window" python test.py
**OPENS NEW WINDOW**
$ tasklist /V /FI "WINDOWTITLE eq my_python_window"
**PID shows as 13592**

Then we get this:

$ py-spy dump --pid 13592
Error: Failed to find python version from target process

I wrote a script that gets all the python PIDs and runs py-spy on them. Here’s what we got:

$ py-spy dump --pid 13592
Error: Failed to find python version from target process



$ py-spy dump --pid 34632

Process 34632: C:\Users\ryan\.conda\envs\ray\python.exe test.py
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 1644 (active+gil): "MainThread"
    connect (ray\worker.py:1257)
    init (ray\worker.py:851)
    wrapper (ray\_private\client_mode_hook.py:62)
    <module> (test.py:2)

$ py-spy dump --pid 35092
Error: Failed to find python version from target process



$ py-spy dump --pid 21524

Process 21524: C:\Users\ryan\.conda\envs\ray\python.exe -u C:\Users\ryan\Downloads\temp\venv\lib\site-packages\ray\autoscaler/_private\monitor.py --logs-dir=C:\Users\ryan\AppData\Local\Temp\ray\session_2021-07-13_18-31-54_724999_34632\logs --redis-address=192.168.1.6:6379 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000 --monitor-ip=192.168.1.6
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 35564 (active): "MainThread"
    _run (ray\autoscaler\_private\monitor.py:234)
    run (ray\autoscaler\_private\monitor.py:317)
    <module> (ray\autoscaler\_private\monitor.py:416)
Thread 36860 (active): "Thread-1"
    _select (selectors.py:315)
    select (selectors.py:324)
    serve_forever (socketserver.py:232)
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)

$ py-spy dump --pid 33580
Error: Failed to find python version from target process



$ py-spy dump --pid 35732

Process 35732: C:\Users\ryan\.conda\envs\ray\python.exe -u C:\Users\ryan\Downloads\temp\venv\lib\site-packages\ray\new_dashboard\dashboard.py --host=127.0.0.1 --port=8265 --port-retries=50 --redis-address=192.168.1.6:6379 --temp-dir=C:\Users\ryan\AppData\Local\Temp\ray --log-dir=C:\Users\ryan\AppData\Local\Temp\ray\session_2021-07-13_18-31-54_724999_34632\logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 20600 (active): "MainThread"
    _poll (asyncio\windows_events.py:783)
    select (asyncio\windows_events.py:434)
    _run_once (asyncio\base_events.py:1854)
    run_forever (asyncio\base_events.py:596)
    run_forever (asyncio\windows_events.py:316)
    run_until_complete (asyncio\base_events.py:629)
    <module> (ray\new_dashboard\dashboard.py:222)
Thread 2936 (active): "Thread-1"
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)

$ py-spy dump --pid 35540
Error: Failed to find python version from target process



$ py-spy dump --pid 16912

Process 16912: C:\Users\ryan\.conda\envs\ray\python.exe -u C:\Users\ryan\Downloads\temp\venv\lib\site-packages\ray\_private\log_monitor.py --redis-address=192.168.1.6:6379 --logs-dir=C:\Users\ryan\AppData\Local\Temp\ray\session_2021-07-13_18-31-54_724999_34632\logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 33764 (active): "MainThread"
    run (ray\_private\log_monitor.py:290)
    <module> (ray\_private\log_monitor.py:364)

$ py-spy dump --pid 33548
Error: Failed to find python version from target process



$ py-spy dump --pid 34232

Process 34232: C:\Users\ryan\.conda\envs\ray\python.exe -u "C:\Users\ryan\Downloads\temp\venv\lib\site-packages\ray\new_dashboard/agent.py" "--node-ip-address=192.168.1.6" "--redis-address=192.168.1.6:6379" --metrics-export-port=64815 --dashboard-agent-port=65140 --node-manager-port=58987 "--object-store-name=tcp://127.0.0.1:62333" "--raylet-name=tcp://127.0.0.1:65482" "--temp-dir=C:\Users\ryan\AppData\Local\Temp\ray" "--log-dir=C:\Users\ryan\AppData\Local\Temp\ray\session_2021-07-13_18-31-54_724999_34632\logs" --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 35232 (active): "MainThread"
    <module> (ray\new_dashboard\agent.py:311)
Thread 28820 (active): "Thread-1"
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)

$ py-spy dump --pid 36048
Error: Failed to find python version from target process



$ py-spy dump --pid 20632

Process 20632: C:\Users\ryan\.conda\envs\ray\python.exe check_pids.py
Python v3.9.6 (C:\Users\ryan\.conda\envs\ray\python.exe)

Thread 31568 (active): "MainThread"
    _wait_for_tstate_lock (threading.py:1069)
    join (threading.py:1053)
    _communicate (subprocess.py:1508)
    communicate (subprocess.py:1134)
    run (subprocess.py:507)
    <module> (check_pids.py:21)
Thread 37384 (active): "Thread-23"
    _readerthread (subprocess.py:1479)
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)
Thread 21904 (active): "Thread-24"
    _readerthread (subprocess.py:1479)
    run (threading.py:910)
    _bootstrap_inner (threading.py:973)
    _bootstrap (threading.py:930)

rliaw · July 14, 2021, 9:54pm

Hey @rkechols, unfortunately this doesn’t provide too much visibility for me…

If possible, I would recommend reverting to an older version of Ray where things don’t hang as easily on Windows. You may need an older version of the docs to get things working.

We’re gonna invest a bit more heavily on this this upcoming quarter/next few releases, but until then, things just might not work well on Windows… Sorry about that.

Topic		Replies	Views
All ray tune trials pending when increasing trainingset size Ray Tune	2	713	November 2, 2022
Tune & Pytorch Lightning: trials do not terminate, others do Ray Tune	1	415	December 5, 2022
No training starts although flag is running Ray Tune	1	505	September 21, 2022
Trials did not complete on distributed tuning	0	298	April 18, 2023
Could not find best trial. Did you pass the correct `metric` parameter? Ray Tune	3	1434	December 17, 2021

All trials PENDING, never RUNNING

Related topics