Global variables to maintain a worker-specific state

kaktus018 · September 26, 2023, 9:07am

Hello,

I do have a question regarding the use of global variables in ray remote tasks. Consider the following situation: I need to launch an external process to do some calculations and want to parallelize it with ray. Now it would be wise to keep this process connected to the specific ray worker instead of restarting it for every task to increase the performance. E.g. using Excel via COM on Windows:

import ray
import win32com.client as win32

def f(x):
    global excel
    if not 'excel' in globals():
        excel = win32.Dispatch('Excel.Application')
        print('Launched Excel')
    else:
        print('Reusing Excel')
    return excel.Evaluate(str(x) + '*2')

ray.init(num_cpus=1)
fRemote = ray.remote(f)
futures = [fRemote.remote(x) for x in range(5)]
print(ray.get(futures))

In this example I use a global variable to store a reference to the Excel COM instance and thus avoid restarting Excel in every function call. This works as expected and gives as output:

(f pid=63264) Launched Excel
(f pid=63264) Reusing Excel
(f pid=63264) Reusing Excel
(f pid=63264) Reusing Excel
[0.0, 2.0, 4.0, 6.0, 8.0]
(f pid=63264) Reusing Excel

So far so good. But I now have the situation that the function to execute in parallel is actually a class method which is called by a wrapper function in the following way:

class MyProblem:
    def __init__(self):
        self.func = f

def evalFunc(problem,x):
    return problem.func(x)

ray.init(num_cpus=1)
evalFuncRemote = ray.remote(evalFunc)
problemRemote = ray.put(MyProblem())
futures = [evalFuncRemote.remote(problemRemote,x) for x in range(5)]
print(ray.get(futures))

Now I would expect the same behavior as before. However, the output is now

[0.0, 2.0, 4.0, 6.0, 8.0]
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel
(evalFunc pid=6832) Launched Excel

So Excel is started on each function call and this impacts performance significantly (especially when not using Excel but other applications with a start up time of > 10 s).
However, just running everything serially with

problem = MyProblem()
print([evalFunc(problem, x) for x in range(5)])

results in the expected behavior again:

Launched Excel
Reusing Excel
Reusing Excel
Reusing Excel
Reusing Excel
[0.0, 2.0, 4.0, 6.0, 8.0]

Could anybody give advice on how to maintain the instance of Excel in this example across several tasks on the same worker? FYI: I’m using Windows 10 with Python 3.11.4 and ray 2.6.3.

Thank you very much!
Dominik

sangcho · September 27, 2023, 12:11am

Global variable is not recommended to be used with Ray. See more details here; Anti-pattern: Using global variables to share state between tasks and actors — Ray 3.0.0.dev0.

You can instead use Ray actor if you’d like to keep a state (such as excel)

ray.init(num_cpus=1)

@ray.remote
class ProblemActor:
    def __init__(self):
        self.excel = win32.Dispatch('Excel.Application')

    def f(self, x):
        return excel.Evaluate(str(x) + '*2')

NUM_ACTORS = 1
actors = [Actor.remote() for _ in range(NUM_ACTORS)]
ray.get(actors[0].f.remote())

kaktus018 · September 27, 2023, 6:50am

Hi snagcho,
thanks for your reply! I understand that actors are capabable of this. However, as mentioned in https://docs.ray.io/en/latest/ray-core/actors.html, I would now have to manually create actors according to the number of CPUs available and also manually assign tasks to each actor, which is a burden especially when the number of CPUs changes (e.g. connecting another node during runtime). That’s the beauty with tasks as workers are started automatically and I don’t have to worry about scheduling - I just have to create jobs. Or do I miss something here?
Also the anti-pattern you mentioned is referred to sharing states between processes. That’s not what I want to do. I want to preserve a state (variable) whitin a process across different tasks. I found this seven year old documentation of reusable variables: https://github.com/ray-project/ray-legacy/blob/master/doc/reusable-variables.md
That would be similar to what I need. But I don’t know if such a structure is still implemented in ray (ray.Reusable does not exist in ray core).
Any other ideas?
Thanks!

kaktus018 · September 27, 2023, 8:21am

I tried out another way to achieve persistent variables in a worker process. Apart from global variables also variables in modules are preserved across function calls in Python (unless the module is explicitly reimported). So what I did is to create a simple module called “process_persistent” and placed it in the Lib folder of Python. It simply contains a dict “data” to store arbitrary data and a cleanUp function to call destructors if needed:

data = {}

def cleanUp():
    for d in data.values():
        del d

Now if use the data variable of this module instead of a global variable, everything works fine with:

import ray
import win32com.client as win32

def f(x):
    import process_persistent as persistent
    if not 'excel' in persistent.data.keys():
        persistent.data['excel'] = win32.Dispatch('Excel.Application')
        print('Launched Excel')
    else:
        print('Reusing Excel')
    return persistent.data['excel'].Evaluate(str(x) + '*2')
    
class MyProblem:
    def __init__(self):
        self.func = f

def evalFunc(problem,x):
    return problem.func(x)

ray.init(num_cpus=1)
evalFuncRemote = ray.remote(evalFunc)
problem = MyProblem()
problemRemote = ray.put(problem)
futures = [evalFuncRemote.remote(problemRemote,x) for x in range(5)]
print(ray.get(futures))

This gives the output:

[0.0, 2.0, 4.0, 6.0, 8.0]
(evalFunc pid=40496) Launched Excel
(evalFunc pid=40496) Reusing Excel 
(evalFunc pid=40496) Reusing Excel 
(evalFunc pid=40496) Reusing Excel 
(evalFunc pid=40496) Reusing Excel

So the issue seems to be resolved for me. Any other ideas and comments are highly appreciated before I mark this as resolved.

Thanks again!

Topic		Replies	Views
Modified global variable between consecutive remote calls Ray Core	3	441	March 1, 2021
Synchronize ray.remote with a global state actor Ray Core	3	257	December 19, 2023
Pass arguments to remote function without serialization Ray Client	3	787	April 27, 2023
Synchronize multiple ray.remote functions in Python Ray Core	0	152	December 9, 2023
How to create a global variable / lock when using the Ray trainer?	1	485	October 31, 2022

Global variables to maintain a worker-specific state

Related topics