Resume Training: Load and Retrain Keras Models

Introduction
Step-by-Step Guide
Code Example
Additional Notes
Summary
Conclusion
References

Introduction

In the realm of machine learning, the ability to save and load trained models is paramount. This allows us to preserve the knowledge gained during training and reuse models without retraining from scratch. This article provides a concise guide on how to save, load, and resume training Keras models using the HDF5 format, a popular choice for storing large binary objects. We'll cover the essential code snippets and highlight crucial considerations for a smooth and effective workflow.

Step-by-Step Guide

Save your model:
```
model.save('my_model.h5') 
```
This saves the model architecture, weights, and optimizer state.

Load your model:

from keras.models import load_model
model = load_model('my_model.h5')

Compile the model (if needed): If you didn't save the optimizer state or are changing optimizers, you'll need to compile:
```
model.compile(loss='...', optimizer='...', metrics=['...']) 
```
Continue training:
```
model.fit(new_data, new_labels, epochs=..., initial_epoch=previous_epochs)
```
- Replace '...' with your data, loss function, optimizer, and metrics.
- initial_epoch should be set to the epoch you left off on.

Important Considerations:

Optimizer state: Ensure the optimizer state is saved and loaded correctly for seamless continuation.
Learning rate: You might need to adjust the learning rate when resuming training, especially if you're adding new data.
Overfitting: Be cautious of overfitting to the new data if it's significantly different from the original training set.

Code Example

This Python code demonstrates how to save a trained Keras model to a file and then load it back to continue training or perform evaluation. It defines a simple convolutional neural network for image classification using the MNIST dataset. The code trains the model for a few epochs, saves it, and then loads it again to resume training from where it left off. Finally, it evaluates the model's performance on a test set. This approach is useful for breaking down training into multiple sessions, experimenting with different hyperparameters, or sharing trained models.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- 1. Define and Train Initial Model ---

# Create a simple model (replace with your own)
model = keras.Sequential(
    [
        layers.Input(shape=(28, 28, 1)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dense(10, activation="softmax"),
    ]
)

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load sample data (MNIST)
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Train for a few epochs
model.fit(x_train, y_train, epochs=5) 

# --- 2. Save the Model ---
model.save('my_model.h5') 

# --- 3. Load the Model and Resume Training (Simulating a New Session) ---

# Load the saved model
loaded_model = keras.models.load_model('my_model.h5')

# (Optional) Compile if needed (e.g., changing optimizer)
# loaded_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Continue training from where you left off
# - Assume you trained for 5 epochs previously
# - Adjust learning rate if necessary
# - Be mindful of overfitting on new data
loaded_model.fit(x_train, y_train, epochs=10, initial_epoch=5) 

# --- 4. Evaluate the Model ---
loss, accuracy = loaded_model.evaluate(x_test, y_test)
print(f"Loss: {loss}, Accuracy: {accuracy}")

Explanation:

Model Definition and Initial Training:
- We define a simple Convolutional Neural Network (CNN) for image classification using MNIST.
- The model is compiled with an optimizer, loss function, and metrics.
- We load the MNIST dataset and perform initial training for 5 epochs.
Saving the Model:
- The model.save('my_model.h5') line saves the entire model (architecture, weights, optimizer state) to the 'my_model.h5' file.
Loading and Resuming Training:
- We load the saved model using keras.models.load_model().
- The initial_epoch=5 argument in model.fit() ensures training continues from the 6th epoch.
- Important: If you change the optimizer or its parameters after loading, you need to recompile the model using model.compile().
Evaluation:
- Finally, we evaluate the model's performance on the test set.

Key Points:

Seamless Continuation: Loading the model allows you to resume training without starting from scratch, preserving the learned weights and optimizer state.
Learning Rate Adjustment: When resuming training, especially with new data, consider adjusting the learning rate to fine-tune the model and avoid instability.
Overfitting: Be cautious of overfitting if the new data you're training on is significantly different from the original training set. Use techniques like regularization and validation to mitigate this.

Additional Notes

File Size: HDF5 files can become quite large, especially for complex models. Consider using compression (e.g., model.save('my_model.h5', save_format='h5', compression='gzip')) if storage is a concern.
Custom Objects: If your model uses custom layers, loss functions, or metrics, you need to define them again and pass them as a dictionary to load_model using the custom_objects argument.
Framework Compatibility: While HDF5 is widely supported, ensure compatibility if you plan to load the model in a different framework (e.g., PyTorch).
Model Versioning: It's good practice to version your saved models (e.g., 'my_model_v1.h5', 'my_model_v2.h5') to keep track of different training stages or experiments.
Checkpoint During Training: For long training processes, consider saving checkpoints periodically to avoid losing progress if training is interrupted. Keras provides callbacks for this purpose (e.g., ModelCheckpoint).
Data Preprocessing: Remember that any data preprocessing steps applied during training should also be applied to new data before feeding it to the loaded model.
Transfer Learning: Loading a pre-trained model and continuing training on a different but related task is a form of transfer learning. This can often lead to faster convergence and better performance.
Fine-tuning: When using a pre-trained model for transfer learning, you might want to freeze the weights of some layers (especially early layers) and only train the later layers to adapt to the new task.
Deployment: Saved models can be deployed for inference in various environments, including web applications, mobile devices, and embedded systems.

Summary

This table summarizes the key steps and considerations for saving, loading, and resuming training of Keras models:

Task	Code	Explanation	Considerations
Save Model	`model.save('my_model.h5')`	Saves the model architecture, weights, and optimizer state to a file named 'my_model.h5'.
Load Model	`python from keras.models import load_model model = load_model('my_model.h5')`	Loads the saved model from the 'my_model.h5' file.
Compile Model (if needed)	`model.compile(loss='...', optimizer='...', metrics=['...'])`	Compiles the model with the specified loss function, optimizer, and metrics. This is necessary if the optimizer state wasn't saved or if you're changing optimizers.
Resume Training	`model.fit(new_data, new_labels, epochs=..., initial_epoch=previous_epochs)`	Continues training the model from where it left off.	- Optimizer state: Ensure it's saved and loaded correctly. - Learning rate: Adjust if needed, especially with new data. - Overfitting: Be cautious when adding significantly different data.

Key takeaway: This process allows you to save your trained model and resume training later without starting from scratch, saving time and resources.

Conclusion

Saving, loading, and resuming training of Keras models is crucial for efficient machine learning workflows. By leveraging the HDF5 format, we can store a model's complete state, including its architecture, weights, and optimizer configuration. This enables us to reuse trained models, avoiding redundant computations and facilitating collaborative development. When resuming training, it's essential to consider the optimizer state, adjust the learning rate appropriately, and be mindful of potential overfitting, especially when introducing new data. This approach streamlines the model development process, allowing us to build upon previous successes and achieve better performance.

References

How to continue training? - Usage & Issues - Image.sc Forum | I trained a StarDist network and I would like to continue training with new dataset, but I don’t now how to do it.
python - Keras: How to save model and continue training? - Stack ... | Jul 29, 2017 ... However, when I load the model and try training it again, it starts all over as if it hasn't been trained before. The loss doesn't start from ...
Loading a saved model for continue training - PyTorch Forums | I tried to find a solution to that in other threads but I cannot find a problem like mine. I am training a feed-forward NN and once trained save it using: torch.save(model.state_dict(),model_name) Then I get some more data points and I want to retrain the model on the new set, so I load the model using: model.load_state_dict(torch.load(‘file_with_model’)) When i start training the model again, the error increases a lot. To check if it was a problem of the new points or the way I’m loading t...
python - Keras: Unable to continue training of loaded model - Stack ... | Mar 20, 2018 ... It looks very much like the problem with optimizer. For example if learning rate is not saved properly at the value it ended up during last ...
MLFlow: How to load results from model and continue training | I'd like to continue / finetune training of an existing keras/tensorflow model. We use MLFlow to store the model. How can I load the wieght from an existing model to the model and continue "fit" preferable with a different learning rate. Just loading the model fails with some error, PyFuncModel doe...
Loading-saved-keras-model-and-continue-training-from-last-epoch ... | Loading saved keras model and continue training from last epoch - sanpreet/Loading-saved-keras-model-and-continue-training-from-last-epoch
Save and load models | TensorFlow Core | Apr 3, 2024 ... keras—a high-level API to build and train models in TensorFlow. The new, high-level .keras format used in this tutorial is recommended for ...
Cannot resume training using model.save and load_model() · Issue ... | System Information Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes OS Platform and Distribution: CentOS Linux 7.6.1810 TensorFlow installed from ...
train - Is there a way to incorporate new data into an already trained ... | Jun 22, 2018 ... In keras, you can save your model using model.save and then load that model using model.load . If you call .fit again on the model that ...