How severe does this issue affect your experience of using Ray?
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m writing an ML application in which I want to perform a hyperparameter optimization over a collection of models which may alternate between pytorch and tensorflow. Tensorflow unfortunately does not have a mechanism for freeing it’s gpu memory usage (that I know of, if there is one I’ve missed, please tell me! it would make my life much easier).
Now, the issue is, I can’t use tune.report within the launched process. tune looks for a _session object internally and knows it is launched outside of the original trial processes when it doesn’t find it. This is likely because I launch the process using the ‘spawn’ method instead of ‘fork’ as I’m interested in having this method work on windows as well. (‘fork’ method is only supported on linux).
What is the proper way to do this reporting from another ‘spawned’ process? Can I pass a tune reporter as an argument to the spawned process?
Are you performing communication between the spawned process and the process that’s launching it?
Another idea I had is to execute the process as a Ray Task! And here’s an example for how you can ensure that the GPU memory is cleared after executing the task.
Yes, when launching the process, I create a queue which is responsible for sending results back to the primary thread.
I have considered writing another queue to send the ‘report’ results in a similar way however it’s a bit of work and requires parts of my library to be tied to ray tune which I find undesirable. (I’m interested in my library being general purpose.)
Can I wrap a trainable function with ray.remote and max_calls=1 and have tune properly execute it? I’m probably going to just give it a try and see what happens.
I tried decorating my function trainable with @ray.remote(max_calls=1) as suggested.
I needed to pass the trainable function as trainable_function.remote as it doesn’t have a plain call method. This worked, however at the finish of the first trial, I got the error:
ValueError: Invalid return or yield value. Either return/yield a single number or a dictionary object in your trainable function.
Which I think is because the result of a task is an object on which we need to fetch the result.
Hey just to clarify - is the trainable_function the “primary thread” that’s launching the other processes? If so, you should be able to read from the queue in trainable_function and pass the results read from the queue to tune.report().
Yes it’s the ‘primary thread’ which launches the other process.
Yeah, I may wind up doing that in the end, but again, it’s a lot of work to ensure that works correctly, and it will make a perpetual dependency for my library on ray.tune.
I would really prefer if there was a way to do the GPU stuff within the trainable_function, and then have ray launch a new process for the next sample from the search space. The way it works now, it seems like ray launches a single process for each ‘worker’ and those processes persist until the search finishes.
I would really prefer if there was a way to do the GPU stuff within the trainable_function , and then have ray launch a new process for the next sample from the search space.
Actually, I believe this should work. Tune will create a remote Actor process for each Trial, and it will run the trainable_function within. At the end of the Trial, the process will be terminated and the GPU should be cleaned up.
The way it works now, it seems like ray launches a single process for each ‘worker’ and those processes persist until the search finishes.
Do you have a reproduction for this? This sounds possible if you’re running the GPU training directly in Ray Tasks (GPU Support — Ray 2.8.0), but should not be the case for Actors (which Tune uses).
@matthewdeng How are ray actors spawned? Do actors share python objects?
The way I’m managing GPU utilization is with a ‘context’ object which is set at the module level. What’s happening, is I set this object within the trainable function, but for the second (and subsequent trials) the module level python object for that context is already defined, and my library thinks a context is already set. For now, I rely on only acquiring a ‘context’ if I know no other method will need a different context for the duration of the process execution. I typically enforce this by spawning a new process to run that code (which acquires the context it needs). This way when the process exits, the original process’s session doesn’t have the module level context object defined, and I can go and acquire a new context if needed.
What this implies to me, is that somehow python object states are persisting between trials. Which is why I’m asking these questions about how actors and tasks work.
Ah, if you directly pass this context object into the definition of the Trainable, then it will actually serialize the context object and deserialize it in the Actor process that’s running the Trainable. Mutating the context will not reflect in the Trainer’s copy.
For folks coming across this thread, I was launching ray with the local_mode=True option. This apparently changes how Ray works in a number of important ways. When I removed local_mode ray is now launching separate processes like I expect.
Thanks for the update @krafczyk - just for completeness sake, the local_mode option is used almost exclusively for testing purposes, and even then it doesn’t implement the full Ray API. It’s not intended for any kind of actual workload. See also Starting Ray — Ray 2.8.0
This feature is maintained solely to help with debugging, so it’s possible you may encounter some issues. If you do, please file an issue.