Ray train can't run in kaggle

As the picture showed above, while trying to run Ray Train on Kaggle, I’m experiencing a startup error with the message: “ERROR services.py:1330 – Failed to start the dashboard, return code -11.” The system also prompts me to check the ‘dashboard.log’ or ‘dashboard.err’ for further details. I am wondering whether I’m missing a step in my setup, or if Ray is incompatible to run on Kaggle? How can I both disable the Ray dashboard to bypass this issue, and ensure my Ray application runs smoothly on Kaggle without encountering such startup errors?

btw, the log is:

(just 2 rows, and. there’s nothing in err logs)


Is it failing at the ray.init() line?

Hello @matthewdeng, I have the same issue when trying to use Ray on Kaggle. To answer your question, yes it is failing at the ray.init() line. Additionally, even you pass the option include_dashboard=False to ray.init(), it still tries to start the dashboard and throws the same error mentioned by @man_Iron. Moreover, when throwing this error it makes the notebook crash and you have to re-run all cells you had executed before (ie: all variables are deleted). Finally, it is important to note that this error is only thrown when a GPU accelerator (P-100 or T4 x2) is added to the session. In the context of pure a CPU session, ray.init() does not throw any error.

To reproduce, you can just connect to a Kaggle account, add a GPU accelerator to your session and run the following 2 lines:

import ray

You can even add the include_dashboard=False option and it will still throw the error mentioned by @man_Iron.

When I run those 2 lines I get the same error as @man_Iron except today, the first time I connected the P-100. I got the following, more informative, error. But I could not reproduce this error a second time.

Note: Changing the environment to the latest one does not change anything.

Thank you in advance for your help :sweat_smile:

1 Like

Looks like the issue is with grpcio package. Solved the issue by running

!pip install grpcio==1.62.2

I tried installing grpcio and while im no longer getting the same dashboard fail as @man_Iron i have been stuck at this for a while now:

2024-05-15 15:32:19,943	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: