How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’ve run a large scale training run using client-server with an external environment. I’m a little old-school, so I just created the trainer directly and saved checkpoints in the train loop instead of using tune:
trainer = A2CTrainer(config=config)
# Serving and training loop.
i = 0
while True:
results = trainer.train()
trainer.export_policy_checkpoint(
f"checkpoint_directory/checkpoint{i}",
policy_id="3"
)
i += 1
Each checkpoint directory looks like this:
ls -al
total 2008
drwx--S--- 2 rusu1 4096 Jul 26 20:37 ./
drwx--S--- 48 rusu1 249856 Jul 26 20:37 ../
-rw------- 1 rusu1 231 Jul 26 20:37 checkpoint
-rw------- 1 rusu1 814172 Jul 26 20:37 model.data-00000-of-00001
-rw------- 1 rusu1 1084 Jul 26 20:37 model.index
-rw------- 1 rusu1 963545 Jul 26 20:37 model.meta
This ran for about 12 hours, and I am very pleased with the training results. Now I want to do some post-processing, including running a few episodes so I can visualize what the agent learned. I ran this overnight and the compute nodes I had open have since closed, so I no longer have the server process running. This shouldn’t be an issue since I have the checkpoints saved to disk. Following this inference example, I have the following in my post-processing script:
sim = MyExternalEnvSim()
ray.init()
trainer = A2CTrainer(config=config)
checkpoint = 'checkpoint_directory/checkpoint_45/model'
trainer.restore(checkpoint)
# Inference loop below
...
The external sim launches well and the trainer is created without an issue. However, when it attempts to restore from the checkpoint, I get the following error:
Traceback (most recent call last):
File "my_script.py", line 127, in <module>
trainer.restore(checkpoint)
File "/venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 515, in restore
with open(checkpoint_path + ".tune_metadata", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoint_directory/checkpoint_45/model.tune_metadata'
So I guess that restore
automatically looks for a .tune_metadata
file. However, I didn’t run with tune, so I don’t have one of those. Is there a way to load the checkpointed files in the format that I have them?