This Trainer runs the fit method of the given estimator in a non-distributed manner on a single Ray Actor.
I don’t think it can run the distributed training, maybe it’s just the limitation of the framework.
You can wrap your training function in make it distributed using the DataParallelTrainer.
Is there a reason why you must use sciki-learn? RayXGBoost allows you do distributed training,. Whyn not use that option.
I still have difficulties understanding ray object store. Since every node has its own object store, how does it help reducing the transferring of data? Does it only help with different processes within the same node? I read about serialisation between different node here, it’s still not very clear to me. Maybe you could recommend some study resources?
Ray can take advantage of data locality. Worker processes running your tasks or training on the same Ray cluster node can access shared memory with zero-copy access. That is, there no transfer of inter-process data. All that is accessed via shared-memory.
If the task or worker on node N needs data that is stored on node NN then data from node N ->NN will be copied. Subsequently, any workers running now on NN can have zero-copy access to it. This eliminates data transfer between inter-processes on the same node at the initial cost of transferring once between node.
That’s how distributed shared-memory works.
And about option 2, is there a best format to use when exporting from BigQuery for both ray task and ray train?
ray.data support many formats. Parquet seems to be most common, due to compression and performance, and support of push-down predicates.
As I read here about pre-processing, it suggests to only use ray data for the last mile pre-processing. Right now we have our ETL pipeline set up (basic transform, cleansing, simple calculation) but not the pre-processing or the feature engineering.
Ray data is not optimal for ETL or building data pipelines. You can do efficient last-mile transformation or common preprocessing using its support for preprocessors.
Is it recommended to make the pre-processing and training two separate steps in ray? Maybe the preprocessing is its own remote task and training a different one? Since they can require different amount of resources? Is there an end to end example from the (last-mile) preprocessing and training?
Here are some examples for ray data preprocessing.
I hope these resources give you ample fodder.
hth
Jules