Hello,
I do have a question regarding the use of global variables in ray remote tasks. Consider the following situation: I need to launch an external process to do some calculations and want to parallelize it with ray. Now it would be wise to keep this process connected to the specific ray worker instead of restarting it for every task to increase the performance. E.g. using Excel via COM on Windows:
import ray
import win32com.client as win32
def f(x):
global excel
if not 'excel' in globals():
excel = win32.Dispatch('Excel.Application')
print('Launched Excel')
else:
print('Reusing Excel')
return excel.Evaluate(str(x) + '*2')
ray.init(num_cpus=1)
fRemote = ray.remote(f)
futures = [fRemote.remote(x) for x in range(5)]
print(ray.get(futures))
In this example I use a global variable to store a reference to the Excel COM instance and thus avoid restarting Excel in every function call. This works as expected and gives as output:
(f pid=63264) Launched Excel
(f pid=63264) Reusing Excel
(f pid=63264) Reusing Excel
(f pid=63264) Reusing Excel
[0.0, 2.0, 4.0, 6.0, 8.0]
(f pid=63264) Reusing Excel
So far so good. But I now have the situation that the function to execute in parallel is actually a class method which is called by a wrapper function in the following way:
class MyProblem:
def __init__(self):
self.func = f
def evalFunc(problem,x):
return problem.func(x)
ray.init(num_cpus=1)
evalFuncRemote = ray.remote(evalFunc)
problemRemote = ray.put(MyProblem())
futures = [evalFuncRemote.remote(problemRemote,x) for x in range(5)]
print(ray.get(futures))
Now I would expect the same behavior as before. However, the output is now
[0.0, 2.0, 4.0, 6.0, 8.0]
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
So Excel is started on each function call and this impacts performance significantly (especially when not using Excel but other applications with a start up time of > 10 s).
However, just running everything serially with
problem = MyProblem()
print([evalFunc(problem, x) for x in range(5)])
results in the expected behavior again:
Launched Excel
Reusing Excel
Reusing Excel
Reusing Excel
Reusing Excel
[0.0, 2.0, 4.0, 6.0, 8.0]
Could anybody give advice on how to maintain the instance of Excel in this example across several tasks on the same worker? FYI: I’m using Windows 10 with Python 3.11.4 and ray 2.6.3.
Thank you very much!
Dominik