Learn how to troubleshoot and fix the frustrating "CUDA out of memory" error in PyTorch, even when your GPU seems to have plenty of free memory available.
Training deep learning models often requires significant GPU memory, and running out of CUDA memory is a common issue. This article provides a comprehensive guide with twelve practical solutions to troubleshoot and resolve "CUDA out of memory" errors during your training process.
Understand the Error: "CUDA out of memory" means your GPU doesn't have enough memory to store the data and model during training.
Check GPU Memory:
import torch
print(torch.cuda.memory_summary())
Reduce Batch Size:
batch_size = 32 # Try 16, 8, 4, etc.
Use Smaller Data: Load only a portion of your dataset into memory.
Use a Smaller Model: Choose a model with fewer parameters.
Data Loading Optimization:
torch.utils.data.DataLoader(..., num_workers=4, pin_memory=True)
Gradient Accumulation: Simulate larger batch sizes with limited memory.
Mixed Precision Training (fp16): Use half-precision floats.
model.half() # Convert model to fp16
Clear Unused Variables:
del variable
torch.cuda.empty_cache()
Use Gradient Checkpointing: Trade computation time for memory.
Use a GPU with More Memory: If possible, upgrade your hardware.
Monitor Memory Usage: Use tools like nvidia-smi
to track GPU memory.
This Python code defines and trains a simple convolutional neural network (CNN) to classify handwritten digits from the MNIST dataset using PyTorch. It includes data loading, model definition, training loop, and optional techniques for performance optimization like GPU utilization, gradient accumulation, and mixed precision training. The code also provides comments on monitoring memory usage and implementing gradient checkpointing for advanced users.
import torch
import torch.nn as nn
from torchvision import datasets, transforms
# --- Hyperparameters ---
batch_size = 32 # Reduce this if you run out of memory
learning_rate = 0.001
epochs = 10
# --- Data Loading ---
train_dataset = datasets.MNIST(
root="./data",
train=True,
download=True,
transform=transforms.ToTensor(),
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=4, # Use multiple workers for data loading
pin_memory=True, # Pin memory for faster transfer to GPU
)
# --- Model Definition ---
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
self.relu = nn.ReLU()
self.pool = nn.MaxPool2d(kernel_size=2)
self.fc = nn.Linear(16 * 13 * 13, 10)
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = x.view(-1, 16 * 13 * 13)
x = self.fc(x)
return x
# --- Device Selection ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# --- Model, Optimizer, Loss ---
model = SimpleCNN().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()
# --- Training Loop ---
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
# --- Data to Device ---
data, target = data.to(device), target.to(device)
# --- Forward Pass ---
output = model(data)
loss = loss_fn(output, target)
# --- Backward Pass and Optimization ---
optimizer.zero_grad()
loss.backward()
optimizer.step()
# --- Print Progress (Optional) ---
if batch_idx % 100 == 0:
print(
f"Epoch: {epoch+1}/{epochs}, Batch: {batch_idx}/{len(train_loader)}, Loss: {loss.item():.4f}"
)
# --- Clear Unused Variables (Optional) ---
del data, target, output, loss
# --- Gradient Accumulation (Example) ---
# if (batch_idx + 1) % 2 == 0: # Accumulate gradients over 2 batches
# optimizer.step()
# optimizer.zero_grad()
# --- Mixed Precision Training (fp16) ---
# model.half() # Convert model to fp16
# ... (Modify optimizer and data types accordingly)
# --- Gradient Checkpointing (Advanced) ---
# Use libraries like torch.utils.checkpoint
# --- Monitor Memory Usage ---
# Use nvidia-smi command in the terminal
Explanation:
batch_size
.num_workers
to load data in parallel.pin_memory=True
speeds up data transfer to the GPU.data
, target
) to the selected device (CPU or GPU).del
) and call torch.cuda.empty_cache()
within the loop to free up GPU memory.nvidia-smi
in a separate terminal to track how your changes affect memory consumption.Remember to experiment with these techniques one at a time and find the best combination for your specific model and dataset.
General Tips:
Understanding the Error Message:
The "CUDA out of memory" error message often provides additional details:
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:VALUE
environment variable (experiment with different values).Advanced Techniques:
DistributedDataParallel
to split the workload across multiple GPUs.torch.profiler
) or other GPU profiling tools (e.g., NVIDIA Nsight Systems) to identify memory bottlenecks in your code.Code Example Notes:
This guide summarizes common solutions for the "CUDA out of memory" error, which occurs when your GPU lacks sufficient memory for training data and models.
| Solution | Description
By implementing these strategies, you can effectively manage GPU memory and overcome "CUDA out of memory" errors. Remember that the most effective approach often involves a combination of these solutions, tailored to your specific deep learning task. Experiment, monitor your memory usage, and iterate to find the optimal configuration for your training process.