Insanely Slow Start-Up Time

Hi,

I am trying to train a basic pytorch CNN model to ‘denoise’ the MNIST dataset. However, I am noticing that the majority of the time my CPU is barely being used but the RAM keeps playing around. I feel like the issue is it keeps ending processes to start a new one, but the time to create a new process is god-awfuly slow.

Is there anything I can do to speed up that creation time?

My Code: https://pastebin.com/iEkkWrN4

P.S.
I do not get any errors running this other than missing CUDA drivers (I have an AMD graphics card so its w/e)

Are you seeing this issue from Ray 1.0.1? It is probably related to this; https://github.com/ray-project/ray/issues/12052, and this has been fixed in the nightly.

It sounds similar, how would I go about getting the nightly release properly? I had installed ray[tuner] with ‘pip install’ … I’d appreciate if you pointed me in the right direction to try out the fix!

You can follow the instruction here. https://docs.ray.io/en/latest/installation.html#latest-snapshots-nightlies

Please let me know if it doesn’t work!

So it didn’t work, however, I reach out to the Ray Slack Community, and they got it fixed for me -> the command that worked for me was
pip install --user -U ` `https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp38-cp38-win_amd64.whl

Yes, this is how you install nightly. @sangcho, do you think we need to clarify the docs for how to install nightly?

So I had some time to actually try running ray tuner with the nightly version and it’s even worse now. It still takes a loooong time to start. And roughly by the point when it would begin its initial test the program just closes with code 3. (I’m running this through pycharm, python version 3.85).

However, just simply running the program from cmd seemed to work - kinda. The main issue still persists, it would spend most of its time seemingly preparing workers to run rather than actually running them…

I am still very much a beginner with the whole thing. Some things that are logged that may help you guys help me understand why it’s taking so long:

If i’m somehow mismanaging the memory or something, please let me know to speed it up! (and a possible fix for pycharm not being able to run the code but cmd being fine…)

Hmm I don’t see the difference from his command & the doc. You mean the install-nightly doesn’t work?

I see. It’ll be nice if you can create an issue in Ray’s Github page with more details. For example, what’s the setup? What’s your workload? How slow is your process startup? and etc.

I have actually been in touch quite a bit with Richard Liaw over slack and we have started a different thread: OOM command not allowed when used memory > max memory

If those could be merged somehow, that’d be perfect. As they are related but not strictly about the same thing I don’t know if that’s a valid move. However, I can no longer respond for 15 hours (in that specific thread) due to being a new member so we had further discussion on hold.

So with Richard Liaw’s help we narrowed the issue down to

tune.run(tune.with_parameters(train, data=[X_2, original])

Which was actually bugged and passing my whole dataset with the function through redis causing a massive slowdown. Instead, we had to refactor that out and use ray.get and ray.put to pass my dataset. So if anyone has this issue atm:

TL:DR

tune.with_parameters is bugged, do not use it. Use ray.put and ray.get for dataset passing!

2 Likes