Thanks for the reply @sven1977
I tried version 2.0.0dev on Colab. But now I get the following error when running the training:
== Status ==
Memory usage on this node: 1.8/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.45 GiB heap, 0.0/3.72 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /root/ray_results/experiment_full_circle_elevator_nosincos_date_29-06-2021_time_11-28-44_seed_4_NUM_EPISODES_TRAINING_15000_restored
Number of trials: 1/1 (1 ERROR)
Trial name status loc
my_train_fn_None_299a9_00000 ERROR
Number of errored trials: 1
Trial name # failures error file
my_train_fn_None_299a9_00000 1 /root/ray_results/experiment_full_circle_elevator_nosincos_date_29-06-2021_time_11-28-44_seed_4_NUM_EPISODES_TRAINING_15000_restored/my_train_fn_None_299a9_00000_0_2021-06-29_11-28-44/error.txt
(pid=2819) WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=2819) Instructions for updating:
(pid=2819) non-resource variables are not supported in the long term
(pid=2819) WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_probability/python/internal/variadic_reduce.py:115: calling function (from tensorflow.python.eager.def_function) with experimental_compile is deprecated and will be removed in a future version.
(pid=2819) Instructions for updating:
(pid=2819) experimental_compile is deprecated, use jit_compile instead
(pid=2819) 2021-06-29 11:28:48,822 WARNING util.py:53 -- Install gputil for GPU system monitoring.
(pid=2819) 2021-06-29 11:28:49,527 ERROR worker.py:406 -- SystemExit was raised from the worker
(pid=2819) Traceback (most recent call last):
(pid=2819) File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
(pid=2819) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
(pid=2819) File "python/ray/_raylet.pyx", line 444, in ray._raylet.execute_task.function_executor
(pid=2819) File "/usr/local/lib/python3.7/dist-packages/ray/function_manager.py", line 556, in actor_method_executor
(pid=2819) return method(__ray_actor, *args, **kwargs)
(pid=2819) File "/usr/local/lib/python3.7/dist-packages/ray/actor.py", line 988, in __ray_terminate__
(pid=2819) ray.actor.exit_actor()
(pid=2819) File "/usr/local/lib/python3.7/dist-packages/ray/actor.py", line 1064, in exit_actor
(pid=2819) raise exit
(pid=2819) SystemExit: 0
(pid=2819)
(pid=2819) During handling of the above exception, another exception occurred:
(pid=2819)
(pid=2819) Traceback (most recent call last):
(pid=2819) File "python/ray/_raylet.pyx", line 591, in ray._raylet.task_execution_handler
(pid=2819) File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
(pid=2819) File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
(pid=2819) File "python/ray/includes/libcoreworker.pxi", line 33, in ray._raylet.ProfileEvent.__exit__
(pid=2819) File "/usr/lib/python3.7/traceback.py", line 167, in format_exc
(pid=2819) return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=2819) File "/usr/lib/python3.7/traceback.py", line 121, in format_exception
(pid=2819) type(value), value, tb, limit=limit).format(chain=chain))
(pid=2819) File "/usr/lib/python3.7/traceback.py", line 508, in __init__
(pid=2819) capture_locals=capture_locals)
(pid=2819) File "/usr/lib/python3.7/traceback.py", line 359, in extract
(pid=2819) linecache.checkcache(filename)
(pid=2819) File "/usr/lib/python3.7/linecache.py", line 74, in checkcache
(pid=2819) stat = os.stat(fullname)
(pid=2819) File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 403, in sigterm_handler
(pid=2819) sys.exit(1)
(pid=2819) SystemExit: 1
---------------------------------------------------------------------------
TuneError Traceback (most recent call last)
<ipython-input-20-5ba34e9666d1> in <module>()
6 restore=checkpoint_path,
7 name=experiment_name,
----> 8 config=config
9 )
10
/usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote)
541 if incomplete_trials:
542 if raise_on_failed_trial and not state[signal.SIGINT]:
--> 543 raise TuneError("Trials did not complete", incomplete_trials)
544 else:
545 logger.error("Trials did not complete: %s", incomplete_trials)
TuneError: ('Trials did not complete', [my_train_fn_None_299a9_00000])
My code:
analysis = tune.run(my_train_fn,
restore=checkpoint_path,
name=experiment_name,
config=config
)