Understanding global_step in TensorFlow

Introduction
Step-by-Step Guide
Code Example
Additional Notes
Summary
Conclusion
References

Introduction

In TensorFlow, keeping track of your training progress is essential for effective model development. A key component of this is the global_step, a variable that acts as a counter for the total number of training steps taken. This simple counter plays a vital role in various aspects of training, from monitoring progress to implementing sophisticated learning rate schedules.

Step-by-Step Guide

In TensorFlow, the global_step is a variable that keeps track of the total number of training steps taken.

Think of it like a counter that increments every time your model processes a batch of data.

global_step = tf.Variable(1, name="global_step")

It's useful for things like:

Tracking progress: You can see how far along your training is.
Saving checkpoints: You can save your model at specific steps.
Learning rate decay: You can adjust the learning rate based on the step.
TensorBoard visualization: You can log summaries and metrics against the global step to visualize training progress.

summary_writer.add_summary(summary, global_step=global_step)

While TensorFlow 1.x had a built-in tf.train.get_global_step(), in TensorFlow 2.0 and later, it's recommended to manage your own global_step variable.

You can manually increment it within your training loop.

global_step.assign_add(1)

Even though you manage it yourself, the global_step remains a crucial aspect of tracking and managing your training process in TensorFlow.

Code Example

This Python code defines and trains a simple feedforward neural network using TensorFlow. It defines a two-layer model, an Adam optimizer, and the mean squared error loss function. The code then iterates through epochs and batches of data, calculating the loss and gradients, applying gradients to update model weights, and incrementing a global step variable. It logs the loss and mean absolute error to TensorBoard, prints progress updates, and saves checkpoints at specific global step intervals. The global step variable is used to track training progress and associate summaries and checkpoints with specific training iterations.

import tensorflow as tf

# Define your model
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(10, activation='relu', input_shape=(100,)),
  tf.keras.layers.Dense(1)
])

# Define your optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

# Define your loss function
loss_fn = tf.keras.losses.MeanSquaredError()

# Define your metrics
metrics = ['mean_absolute_error']

# Initialize the global step variable
global_step = tf.Variable(1, name="global_step", trainable=False, dtype=tf.int64)

# Create a TensorBoard summary writer
summary_writer = tf.summary.create_file_writer('logs/training')

# Training loop
for epoch in range(10):
  for batch_idx, (x_batch, y_batch) in enumerate(dataset):
    with tf.GradientTape() as tape:
      predictions = model(x_batch)
      loss = loss_fn(y_batch, predictions)

    # Calculate gradients
    gradients = tape.gradient(loss, model.trainable_variables)

    # Apply gradients and update model weights
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Increment the global step
    global_step.assign_add(1)

    # Log metrics and summaries
    with summary_writer.as_default():
      tf.summary.scalar('loss', loss, step=global_step)
      tf.summary.scalar('mean_absolute_error', 
                        tf.keras.metrics.mean_absolute_error(y_batch, predictions), 
                        step=global_step)

    # Print progress
    if global_step % 100 == 0:
      print(f'Epoch: {epoch}, Global Step: {global_step.numpy()}, Loss: {loss.numpy()}')

    # Example: Save a checkpoint every 500 steps
    if global_step % 500 == 0:
      checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer, global_step=global_step)
      checkpoint.save(file_prefix='./checkpoints/my_checkpoint')

# Close the summary writer
summary_writer.close()

Explanation:

Initialization:
- We define the model, optimizer, loss function, and metrics.
- The global_step variable is initialized to 1.
- A TensorBoard summary writer is created to log training progress.
Training Loop:
- We iterate through epochs and batches of data.
- Inside the loop, we perform the following:
  - Calculate the loss and gradients.
  - Apply gradients to update model weights.
  - Increment global_step using assign_add(1) after each batch.
  - Log the loss and metrics to TensorBoard, associating them with the current global_step.
  - Print progress updates based on the global_step.
  - Save checkpoints at specific global_step intervals.
TensorBoard Visualization:
- Run tensorboard --logdir=logs/training in your terminal to visualize the training progress, including the loss and metrics plotted against the global_step.
Checkpoint Saving:
- The code includes an example of saving checkpoints every 500 steps. This allows you to resume training from a specific point later.

Key Points:

In TensorFlow 2.0+, you manage the global_step manually.
Increment it consistently after each batch of data is processed.
Use the global_step for logging, checkpointing, and potentially learning rate scheduling.
Visualize training progress effectively using TensorBoard by associating summaries with the global_step.

Additional Notes

Relationship with epochs: While an epoch represents a full pass through your entire training dataset, the global_step increments with each batch processed. Therefore, the global_step provides a more granular view of the training progress, especially when dealing with large datasets that require multiple epochs.
Customization: You can initialize the global_step to a value other than 1 if you're resuming training from a previously saved checkpoint. This ensures continuity in your training process.
Alternatives in Keras: If you're using Keras for model training, you might not directly interact with the global_step variable. Keras provides callbacks like ModelCheckpoint and TensorBoard that internally handle the global_step for saving checkpoints and logging summaries.
Debugging and analysis: By analyzing the metrics and summaries logged against the global_step in TensorBoard, you can gain insights into your model's training behavior, identify potential issues like overfitting or slow convergence, and make informed decisions about hyperparameter tuning.

Summary

Feature	Description
What it is	A `tf.Variable` acting as a counter for training steps (batches processed).
How to create it	`global_step = tf.Variable(1, name="global_step")`
How to increment it	`global_step.assign_add(1)` within your training loop.
Key uses
- Tracking progress	Monitor how far training has progressed.
- Saving checkpoints	Save the model at specific steps.
- Learning rate decay	Adjust learning rate dynamically.
- TensorBoard visualization	Log summaries and metrics against the step.
TensorFlow 1.x vs 2.x
- 1.x	Used `tf.train.get_global_step()`.
- 2.x onwards	Manually manage the `global_step` variable.

Key takeaway: Even though manually managed in TensorFlow 2.0+, global_step remains essential for tracking, managing, and visualizing your training process.

Conclusion

The global_step in TensorFlow, while a simple concept, is a fundamental tool for managing and understanding your model's training process. It acts as a universal counter, tracking the number of training steps taken, and seamlessly integrates with other TensorFlow components for checkpointing, learning rate decay, and visualization in TensorBoard. Whether you're working on a single machine or in a distributed training environment, effectively using and monitoring the global_step can significantly improve your model development workflow and lead to more informed training decisions.

References

tf.compat.v1.train.global_step | TensorFlow v2.16.1 | Small helper to get the global step.
tensorflow - Why does add_summary take an Integer for global_step ... | Oct 25, 2017 ... In most cases you train your models for hundreds of thousand of steps. Saving each time all the summaries is expensive and does not make ...
what is global step and loss? · Issue #2980 · tensorflow/models ... | Can some body explain for me about: 1- global step in the output of object detection API: Is global step the iteration? does it refer to the batch? and what is the difference of global step/sec and...
tensorflow2.0 - Is there a replacement for global_step in TensorFlow ... | Mar 26, 2019 ... Right now is better to declare our own global_step = tf.Variable(1, name="global_step") and use it manually. Looking at the documentation ...
Code with Eager Execution, Run with Graphs: Optimizing Your Code ... | The TensorFlow blog contains regular news from the TensorFlow team and the community, with articles on Python, TensorFlow.js, TF Lite, TFX, and more.
Wrong result when computing accuracy using tf.metrics.accuracy ... | System information OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 1709 TensorFlow installed from (source or binary): pip TensorFlow version (use command below): 1.4.0 Python ve...
[RLlib] Problem with TFModelV2 loading after having saved one with ... | Hello everybody! I am new to topic creation in general so I would request a bit of patience if I do not do everything correctly from the start 😃 Background I use trainer.get_policy().export_model() export to get a TFModel. The model is exported succesfully. What’s the problem? I am unable to load with tf.saved_model.load() or use the model with SavedModel CLI. Script ray.init() trainer = DQNTrainer(env="CartPole-v0") for i in range(2): result = trainer.train() print(pretty_print...
Monitoring PyTorch Training using Tensorboard • David Stutz | This article presents a simple example, including code, for monitoring PyTorch training using Tensorboard.
torch.utils.tensorboard — PyTorch 2.5 documentation | Before going further, more details on TensorBoard can be found at https://www.tensorflow.org/tensorboard/ ... global_step (int) – Global step value to record.