How severe does this issue affect your experience of using Ray?.
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Description of the issue.
Hello, I’m currently trying to setup a training pipeline using Ray Data and Ray Train.
The setup is the following:
- The preprocessing part is developed using Torch and OnnxScript
- The model training part is done in Torch
- We use read_parquet followed by map_batches to run the preprocessing
- We use iter_torch_batches to feed the model training loop
My preprocessing is done in two part, one in torch, and one in ONNX, for the later, I need to instantiate an onnxruntime session on the ray workers. I tried by specifying a function in map_batches, but it uses a ray tasks, and re-create the onnx session for every batch, making the computation slow (creating an onnxruntime session is costly).
I then switched and tried to specify a class in map_batches in order to use ray actors, but I can’t manage to get even the same performances as the ray tasks. When trying this, I need to specify manually the concurrency/number of cpus, and it looks like I can’t get an adequate setup.
But since the actor instantiate the session only once at startup, we should be able to reach better computation time using this (and it is stated in the documentation to use ray actors when doing stateful computations).
Are there any guidelines on how to size ray actors when using them in map_batches ?
How are tasks sized when using tasks in map_batches ? It looks like ray automatically sizes the worker used for the tasks, and that it works pretty well. I saw in the documentation that it starts one worker per CPU available, buy trying this setup with actors didn’t work well.