What is the correct technique for incorporating access to large machine learning data sets in a Ray Tune Tuner() object?

I’m attempting to use Ray Tune to optimize the hyperparameters for a pytorch model which draws upon training data from one of the standard data sets that come included with the pytorch distribution.

My first attempt, from outside of the Tuner() object, was to try to create a reference or handle object pointing to the data set, and then pass the handle in to the trainable function, where I had intended that it would be reformulated into a pytorch DataLoader() iterator. (For brevity here, I leave out all of the code that would normally be needed to define the model architecture and train a model instance, and instead I just report a random number as though it were the minimized loss value that would normally be the result at the end of the training schedule.)

from torchvision import datasets
from torchvision.transforms import ToTensor
from functools import partial
import ray
from ray import train, tune
import numpy as np

# Trivial Ray Tune practice example showing a failed attempt at passing
# data into a trainable function.
#
# Expected result: Tuner() object should take 10 random draws from the
# allowed set of hyperparameter values defined in the config dictionary,
# and report a random number for the training loss value associated with
# each.
# 
# Actual result: tuner.fit() throws an exception, apparently because passing
# in the data in this way results in the trainable function's memory footprint
# becoming too large.  (The expected result is obtained if I comment out
# one of the lines and replace with a substitute that doesn't pass in any
# data.)

print('Ray version number {0}'.format(ray.__version__))

dthan = dict()
for splitid, trnopt in zip(['train', 'test'], [True, False]):
    # Within the pytorch framework, I think this creates some kind of initial
    # "handle" which can be used to facilitate further data access
    dthan[splitid] = datasets.FashionMNIST(root="data", train=trnopt,
        download=True, transform=ToTensor())

# Dummy "trainable" function, to be passed in to the Tuner(); just returns
# a random number rather actually training anything.  (In a real
# hyperparameter tuning exercise, the data_handle would be used to create
# a batched pytorch DataLoader(), and we'd use this as an iterator to
# feed training data into the neural network model, ultimately resulting
# in some "loss" value that we report at the end of the training schedule.)
def dummytrain(config, data_handle):
    train.report({'loss': np.random.uniform()})

# Hyperparameter search space
config = {
    "lr": tune.loguniform(1e-7, 1e-1),
    "batch_size": tune.choice([2, 4, 8, 16, 32, 64, 128])
}

# Just accept the default options for a search algorithm and scheduler
tuner = tune.Tuner(
    # Create a wrapped version of dummytrain with one of the input parameters
    # (data_handle) already pre-defined, so that ray tune only needs to pass
    # in the config value for each finalized instance of this function
    partial(dummytrain, data_handle=dthan),
    # If you comment out the above line and uncomment this one, then ray tune
    # behaves as expected, reporting random numbers for the loss value
    # defined in the trainable function
    #partial(dummytrain, data_handle=None),
    # Take 10 draws from the search space
    tune_config=tune.TuneConfig(num_samples=10),
    # Pass in the search space
    param_space=config
)
# Attempt to tune the hyperparameters
tuner.fit()

This results in a cascade of exceptions, the second one apparently raised while attempting to handle the first:

Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %run raytest.py
Ray version number 2.8.0
2023-12-01 21:58:20,424	INFO worker.py:1673 -- Started a local Ray instance.
2023-12-01 21:58:23,044	INFO tune.py:220 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2023-12-01 21:58:23,047	INFO tune.py:595 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
╭───────────────────────────────────────────────────────────────────╮
│ Configuration for experiment     dummytrain_2023-12-01_21-58-18   │
├───────────────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator            │
│ Scheduler                        FIFOScheduler                    │
│ Number of trials                 10                               │
╰───────────────────────────────────────────────────────────────────╯

View detailed results here: /Users/stachyra/ray_results/dummytrain_2023-12-01_21-58-18
To visualize your results with TensorBoard, run: `tensorboard --logdir /Users/stachyra/ray_results/dummytrain_2023-12-01_21-58-18`
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/tune.py:1007, in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, storage_path, storage_filesystem, search_alg, scheduler, checkpoint_config, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, reuse_actors, raise_on_failed_trial, callbacks, max_concurrent_trials, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, chdir_to_trial_dir, local_dir, _experiment_checkpoint_dir, _remote, _remote_string_queue, _entrypoint)
   1006 while not runner.is_finished() and not experiment_interrupted_event.is_set():
-> 1007     runner.step()
   1008     if has_verbosity(Verbosity.V1_EXPERIMENT):

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py:731, in TuneController.step(self)
    730 # Handle one event
--> 731 if not self._actor_manager.next(timeout=0.1):
    732     # If there are no actors running, warn about potentially
    733     # insufficient resources
    734     if not self._actor_manager.num_live_actors:

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/air/execution/_internal/actor_manager.py:191, in RayActorManager.next(self, timeout)
    190 # We always try to start actors as this won't trigger an event callback
--> 191 self._try_start_actors()
    193 # If an actor was killed, this was our event, and we return.

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/air/execution/_internal/actor_manager.py:361, in RayActorManager._try_start_actors(self, max_actors)
    360 # Start Ray actor
--> 361 actor = remote_actor_cls.remote(**kwargs)
    363 # Track

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/actor.py:686, in ActorClass.options.<locals>.ActorOptionWrapper.remote(self, *args, **kwargs)
    685 def remote(self, *args, **kwargs):
--> 686     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/auto_init_hook.py:24, in wrap_auto_init.<locals>.auto_init_wrapper(*args, **kwargs)
     23 auto_init_ray()
---> 24 return fn(*args, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/util/tracing/tracing_helper.py:388, in _tracing_actor_creation.<locals>._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs)
    387     assert "_ray_trace_ctx" not in kwargs
--> 388     return method(self, args, kwargs, *_args, **_kwargs)
    390 class_name = self.__ray_metadata__.class_name

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/actor.py:889, in ActorClass._remote(self, args, kwargs, **actor_options)
    885     # After serialize / deserialize modified class, the __module__
    886     # of modified class will be ray.cloudpickle.cloudpickle.
    887     # So, here pass actor_creation_function_descriptor to make
    888     # sure export actor class correct.
--> 889     worker.function_actor_manager.export_actor_class(
    890         meta.modified_class,
    891         meta.actor_creation_function_descriptor,
    892         meta.method_meta.methods.keys(),
    893     )
    895 resources = ray._private.utils.resources_from_ray_options(actor_options)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/function_manager.py:531, in FunctionActorManager.export_actor_class(self, Class, actor_creation_function_descriptor, actor_method_names)
    522 actor_class_info = {
    523     "class_name": actor_creation_function_descriptor.class_name.split(".")[-1],
    524     "module": actor_creation_function_descriptor.module_name,
   (...)
    528     "actor_method_names": json.dumps(list(actor_method_names)),
    529 }
--> 531 check_oversized_function(
    532     actor_class_info["class"],
    533     actor_class_info["class_name"],
    534     "actor",
    535     self._worker,
    536 )
    538 self._worker.gcs_client.internal_kv_put(
    539     key, pickle.dumps(actor_class_info), True, KV_NAMESPACE_FUNCTION_TABLE
    540 )

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/_private/utils.py:755, in check_oversized_function(pickled, name, obj_type, worker)
    744 error = (
    745     "The {} {} is too large ({} MiB > FUNCTION_SIZE_ERROR_THRESHOLD={}"
    746     " MiB). Check that its definition is not implicitly capturing a "
   (...)
    753     ray_constants.FUNCTION_SIZE_ERROR_THRESHOLD // (1024 * 1024),
    754 )
--> 755 raise ValueError(error)

ValueError: The actor ImplicitFunc is too large (105 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/deep_learning/neuralnetwork_exercises/raytest.py:62
     47 tuner = tune.Tuner(
     48     # Create a wrapped version of dummytrain with one of the input parameters
     49     # (data_handle) already pre-defined, so that ray tune only needs to pass
   (...)
     59     param_space=config
     60 )
     61 # Attempt to tune the hyperparameters
---> 62 tuner.fit()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/tuner.py:364, in Tuner.fit(self)
    362 if not self._is_ray_client:
    363     try:
--> 364         return self._local_tuner.fit()
    365     except TuneError as e:
    366         raise TuneError(
    367             _TUNER_FAILED_MSG.format(
    368                 path=self._local_tuner.get_experiment_checkpoint_dir()
    369             )
    370         ) from e

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:526, in TunerInternal.fit(self)
    524 param_space = copy.deepcopy(self.param_space)
    525 if not self._is_restored:
--> 526     analysis = self._fit_internal(trainable, param_space)
    527 else:
    528     analysis = self._fit_resume(trainable, param_space)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py:645, in TunerInternal._fit_internal(self, trainable, param_space)
    632 """Fitting for a fresh Tuner."""
    633 args = {
    634     **self._get_tune_run_arguments(trainable),
    635     **dict(
   (...)
    643     **self._tuner_kwargs,
    644 }
--> 645 analysis = run(
    646     **args,
    647 )
    648 self.clear_remote_string_queue()
    649 return analysis

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/tune.py:1014, in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, storage_path, storage_filesystem, search_alg, scheduler, checkpoint_config, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, reuse_actors, raise_on_failed_trial, callbacks, max_concurrent_trials, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, chdir_to_trial_dir, local_dir, _experiment_checkpoint_dir, _remote, _remote_string_queue, _entrypoint)
   1012             _report_air_progress(runner, air_progress_reporter)
   1013 except Exception:
-> 1014     runner.cleanup()
   1015     raise
   1017 tune_taken = time.time() - tune_start

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py:2025, in TuneController.cleanup(self)
   2023 def cleanup(self):
   2024     """Cleanup trials and callbacks."""
-> 2025     self._cleanup_trials()
   2026     self.end_experiment_callbacks()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py:845, in TuneController._cleanup_trials(self)
    840     trial = self._actor_to_trial[tracked_actor]
    841     logger.debug(
    842         f"Scheduling trial stop at end of experiment (trial {trial}): "
    843         f"{tracked_actor}"
    844     )
--> 845     self._schedule_trial_stop(trial)
    847 # Clean up cached actors now
    848 self._cleanup_cached_actors(force_all=True)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py:1455, in TuneController._schedule_trial_stop(self, trial, exception)
   1451 self._actor_to_trial.pop(tracked_actor)
   1453 trial.set_ray_actor(None)
-> 1455 self._remove_actor(tracked_actor=tracked_actor)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py:864, in TuneController._remove_actor(self, tracked_actor)
    863 def _remove_actor(self, tracked_actor: TrackedActor):
--> 864     stop_future = self._actor_manager.schedule_actor_task(
    865         tracked_actor, "stop", _return_future=True
    866     )
    867     now = time.monotonic()
    869     if self._actor_manager.remove_actor(
    870         tracked_actor, kill=False, stop_future=stop_future
    871     ):
    872         # If the actor was previously alive, track

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ray/air/execution/_internal/actor_manager.py:725, in RayActorManager.schedule_actor_task(self, tracked_actor, method_name, args, kwargs, on_result, on_error, _return_future)
    722 if tracked_actor not in self._live_actors_to_ray_actors_resources:
    723     # Actor is not started, yet
    724     if tracked_actor not in self._pending_actors_to_attrs:
--> 725         raise ValueError(
    726             f"Tracked actor is not managed by this event manager: "
    727             f"{tracked_actor}"
    728         )
    730     # Cache tasks for future execution
    731     self._pending_actors_to_enqueued_actor_tasks[tracked_actor].append(
    732         (tracked_actor_task, method_name, args, kwargs)
    733     )

ValueError: Tracked actor is not managed by this event manager: <TrackedActor 311442843179311338062397268191606075770>

In [2]:

The error message from the first exception seems to suggest that I should not pass data into the trainable function this way, because it results in a function that occupies to large of a memory footprint. The error message furthermore suggests using ray.put() as some kind of workaround:

ValueError: The actor ImplicitFunc is too large (105 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.

My question: In general, what is the recommended technique for passing access to training data into a “trainable” function, while avoiding this kind of error condition? And does that technique actually involve using ray.put(), or should I ignore that part of the error message as irrelevant and/or misleading in this case?