🐶
Kubernetes

Upgrade Failed: Operation in Progress Error (Fix)

By Jan on 02/04/2025

Learn how to fix the "UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress" error and successfully upgrade your system.

Upgrade Failed: Operation in Progress Error (Fix)

Table of Contents

Introduction

The error message "UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress" is a common issue encountered when working with Helm, Kubernetes' package manager. This error typically arises when a previous Helm operation is interrupted or fails to complete successfully, leaving the system in an inconsistent state. This guide provides a step-by-step approach to troubleshoot and resolve this error, ensuring smooth Helm operations in your Kubernetes environment.

Step-by-Step Guide

The error "UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress" typically occurs when a Helm operation is interrupted or doesn't complete successfully. Here's a breakdown of how to troubleshoot and resolve this issue:

  1. Check Helm History:

    helm history <RELEASE_NAME>

    This command shows the revision history of your release. Look for any pending or failed operations.

  2. Rollback to a Previous Revision: If a previous operation is stuck, try rolling back:

    helm rollback <RELEASE_NAME> <REVISION_NUMBER>

    Replace <REVISION_NUMBER> with the desired revision from the helm history output.

  3. Verify and Resolve Underlying Issues:

    • Resource Conflicts: Check if there are conflicting resources in your cluster.
    • Network Problems: Ensure stable network connectivity between your machine and the Kubernetes cluster.
    • Insufficient Resources: Verify your cluster has enough resources (CPU, memory) to complete the operation.
  4. Manual Cleanup (Use with Caution): If the above steps don't work, you might need to manually clean up resources. Proceed with extreme caution, as this can lead to data loss.

    • Identify Stuck Resources: Use kubectl get pods and other relevant kubectl commands to find resources associated with the failed release.
    • Delete Stuck Resources: Carefully delete the identified resources using kubectl delete.
  5. Prevent Future Occurrences:

    • Stable Network: Maintain a reliable network connection during Helm operations.
    • Sufficient Resources: Ensure your cluster has adequate resources.
    • Avoid Interruptions: Try not to interrupt Helm processes.

If the issue persists, consult the documentation for your specific Helm version and Kubernetes distribution.

Code Example

This bash script helps troubleshoot Helm upgrade failures. It first checks the Helm history for failed or pending operations and allows rolling back to a previous revision. If no issues are found in the history, it guides users to manually inspect their Kubernetes cluster for resource conflicts, network problems, or insufficient resources. As a last resort, it provides instructions for manually cleaning up stuck resources, emphasizing caution. The script concludes with tips on preventing similar issues in the future, such as ensuring stable network connectivity, providing sufficient cluster resources, and avoiding interrupting Helm processes.

This script demonstrates how to troubleshoot the "UPGRADE FAILED" error in Helm.

Note: This script is for illustrative purposes only. Always back up your data and proceed with caution, especially when manually deleting resources.

#!/bin/bash

# Set the release name
RELEASE_NAME="my-release"

# 1. Check Helm History
echo "Checking Helm history for release: $RELEASE_NAME"
helm history "$RELEASE_NAME"

# Prompt user to continue if a failed or pending operation is found
read -p "Do you see any failed or pending operations in the history? (y/n) " -n 1 -r
echo    # (move to next line)
if [[ $REPLY =~ ^[Yy]$ ]]
then
    # 2. Rollback to a Previous Revision (User Input Required)
    read -p "Enter the revision number to rollback to: " REVISION_NUMBER
    echo "Rolling back $RELEASE_NAME to revision $REVISION_NUMBER..."
    helm rollback "$RELEASE_NAME" "$REVISION_NUMBER"

    # Check if rollback was successful
    if [[ $? -eq 0 ]]; then
        echo "Rollback successful. Check your application."
    else
        echo "Rollback failed. Proceeding to further troubleshooting."
    fi
else
    echo "No failed or pending operations found in history. Proceeding to resource checks."
fi

# 3. Verify and Resolve Underlying Issues (Manual Inspection Required)
echo "
# --- Manual Inspection Required ---
# Check for:
#   - Resource Conflicts: kubectl get pods, kubectl get events
#   - Network Problems: Connectivity between your machine and the cluster
#   - Insufficient Resources: kubectl describe nodes, kubectl top nodes
# ---
"

# 4. Manual Cleanup (Use with Extreme Caution - User Input Required)
read -p "Have you identified and resolved any underlying issues? (y/n) " -n 1 -r
echo    # (move to next line)
if [[ $REPLY =~ ^[Nn]$ ]]
then
    echo "
    # --- Manual Cleanup (Proceed with Extreme Caution) ---
    # 1. Identify Stuck Resources:
    #    - kubectl get pods -l 'app.kubernetes.io/instance=<RELEASE_NAME>'
    #    - kubectl get services -l 'app.kubernetes.io/instance=<RELEASE_NAME>'
    #    - ... other relevant kubectl commands ...
    # 2. Delete Stuck Resources (Carefully!):
    #    - kubectl delete <resource_type> <resource_name>
    # ---
    "
fi

# 5. Prevent Future Occurrences
echo "
# --- Prevent Future Occurrences ---
# - Ensure stable network connectivity during Helm operations.
# - Provide sufficient resources for your cluster.
# - Avoid interrupting Helm processes.
# ---
"

Explanation:

  1. Check Helm History: The script starts by fetching the release history using helm history.
  2. Rollback (Optional): If the user identifies a failed operation in the history, they can choose to rollback to a previous revision.
  3. Manual Inspection: The script prompts the user to manually inspect the cluster for resource conflicts, network issues, or insufficient resources.
  4. Manual Cleanup (Caution): If necessary, the script provides guidance on manually cleaning up stuck resources. This step requires extreme caution and should only be performed after careful consideration.
  5. Prevent Future Occurrences: The script concludes by reminding the user of best practices to prevent similar issues in the future.

This script provides a starting point for troubleshooting the "UPGRADE FAILED" error. Remember to adapt it to your specific environment and use caution when performing any manual operations.

Additional Notes

  • Helm Version Compatibility: Ensure that your Helm client and Tiller server (if using Helm 2) versions are compatible. Version mismatches can lead to unexpected behavior.
  • Tiller Issues (Helm 2): If using Helm 2, problems with the Tiller server (e.g., crashes, unresponsiveness) can cause this error. Check the status of Tiller and restart it if necessary.
  • Parallel Operations: Avoid running multiple Helm operations on the same release concurrently. This can lead to conflicts and unpredictable results. Use tools like helmfile or CI/CD pipelines to manage sequential deployments.
  • helm list --uninstalled: In some cases, releases marked as uninstalled might still hold locks. Use helm list --uninstalled to check for such releases and remove them with helm delete --purge <RELEASE_NAME>.
  • Kubernetes Events: Inspect Kubernetes events (kubectl get events) for more context on why the Helm operation might have failed. Events often provide valuable clues about resource conflicts, pod failures, or other issues.
  • Debug Helm: Use the --debug flag with Helm commands (e.g., helm upgrade --debug) to get more verbose output, which can help pinpoint the root cause of the problem.
  • Temporary Glitches: Sometimes, the issue might be due to temporary network glitches or resource constraints. Retrying the Helm operation after a short delay might resolve the problem.
  • Clean Up After Manual Deletion: If you manually delete resources, ensure that Helm's state is consistent. You might need to use helm delete --purge <RELEASE_NAME> to remove the release from Helm's history, even if the resources are already deleted.
  • Community Resources: If you're still facing issues, don't hesitate to seek help from the Kubernetes and Helm communities. Online forums, Stack Overflow, and the official documentation are valuable resources for finding solutions.

Summary

This table summarizes how to troubleshoot the Helm error "UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress":

Step Description Command Caution
1. Check Helm History Identify pending or failed operations for the release. helm history <RELEASE_NAME>
2. Rollback to Previous Revision Revert to a working state if a previous operation is stuck. helm rollback <RELEASE_NAME> <REVISION_NUMBER>
3. Verify and Resolve Underlying Issues Investigate and address potential root causes.
* Resource Conflicts Check for conflicting resources in the cluster. kubectl get ...
* Network Problems Ensure stable network connectivity to the cluster.
* Insufficient Resources Verify adequate cluster resources (CPU, memory).
4. Manual Cleanup Delete stuck resources associated with the failed release. kubectl delete ... Proceed with extreme caution! Data loss possible.
5. Prevent Future Occurrences Take preventative measures to avoid similar errors.
* Stable Network Maintain a reliable network connection during operations.
* Sufficient Resources Ensure the cluster has enough resources.
* Avoid Interruptions Avoid interrupting Helm processes.

Note: If the issue persists, consult the documentation for your specific Helm version and Kubernetes distribution.

Conclusion

In conclusion, encountering the "UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress" error in Helm can be disruptive but is usually resolvable. By systematically checking Helm history, attempting rollbacks, and verifying your Kubernetes cluster's state, you can often pinpoint the issue. Remember to proceed with caution, especially when manually manipulating resources. Prioritizing a stable network, sufficient resources, and uninterrupted Helm operations will minimize the likelihood of encountering this error in the future. If problems persist, leverage the wealth of knowledge available in the Kubernetes and Helm communities and their respective documentation.

References

After a few weeks I finally decided to make the move to 22.12... All apps seem to be running and working, however I had to do a reboot at one point and I can't figure out why (when I change the specifications/settings) I get this error all the time:

[EFAULT] Failed to...

Were You Able to Follow the Instructions?

😍Love it!
😊Yes
😐Meh-gical
😞No
🤮Clickbait