🐶
Kubernetes

Kubernetes Nodes Not Ready: Troubleshooting Guide

By Jan on 02/09/2025

Learn the common causes and troubleshooting steps for Kubernetes nodes stuck in 'Not Ready' state, covering network issues, resource constraints, kubelet problems, and more.

Kubernetes Nodes Not Ready: Troubleshooting Guide

Table of Contents

Introduction

Troubleshooting Kubernetes nodes stuck in a NotReady state can be challenging. This guide provides a systematic approach to identify and resolve common issues. We'll cover checking node status, inspecting logs, verifying resources, and more.

Step-by-Step Guide

  1. Check node status: Start by examining the status of your nodes using:

    kubectl describe nodes

    Pay close attention to the "Conditions", "Capacity", and "Allocatable" sections for any error messages or resource constraints.

  2. Inspect kubelet logs: The kubelet is responsible for node health. Access its logs on the problematic node with:

    journalctl -u kubelet

    Look for errors related to container runtime, network issues, or resource exhaustion.

  3. Verify resource availability: Ensure sufficient CPU, memory, and disk space on the node. Use tools like top, free, and df to monitor resource usage.

  4. Check network connectivity: Confirm that the node can communicate with the master node and other nodes in the cluster. Use ping, telnet, or curl to test connectivity.

  5. Inspect container runtime: If you're using Docker, check its status and logs:

    systemctl status docker
    journalctl -u docker

    For other container runtimes, consult their respective documentation.

  6. Review recent changes: Consider any recent changes to the cluster, such as deployments, configuration updates, or network modifications. Revert changes if necessary.

  7. Consult Kubernetes events: Examine Kubernetes events for clues related to the node's NotReady state:

    kubectl get events --all-namespaces
  8. Check control plane health: Ensure the Kubernetes control plane components (API server, scheduler, controller manager) are functioning correctly.

  9. Restart kubelet: If other steps don't reveal the issue, restarting the kubelet service on the affected node might help:

    systemctl restart kubelet
  10. Consult cloud provider documentation: For cloud-based Kubernetes services like EKS or AKS, refer to their documentation for specific troubleshooting steps related to node health.

Code Example

This Bash script helps troubleshoot a Kubernetes node stuck in the NotReady state. It provides a step-by-step guide to check node status, inspect logs (kubelet, Docker), verify resource availability (CPU, memory, disk), test network connectivity, review recent changes, consult Kubernetes events, check control plane health, restart kubelet (requires root), and consult cloud provider documentation. Remember to replace placeholders with actual values and exercise caution as some commands require root privileges.

This script provides examples for each troubleshooting step outlined in the article.

#!/bin/bash

# Node name (replace with the actual node name)
NODE_NAME="your-node-name"

# 1. Check node status
echo "1. Checking node status..."
kubectl describe nodes "$NODE_NAME"

# 2. Inspect kubelet logs
echo "2. Inspecting kubelet logs..."
journalctl -u kubelet | grep "$NODE_NAME"

# 3. Verify resource availability
echo "3. Verifying resource availability..."
echo "CPU usage:"
top -b -n 1 | head -n 5
echo "Memory usage:"
free -h
echo "Disk space:"
df -h

# 4. Check network connectivity
echo "4. Checking network connectivity..."
ping -c 4 google.com
telnet google.com 80
curl -I google.com

# 5. Inspect container runtime (Docker example)
echo "5. Inspecting container runtime (Docker)..."
systemctl status docker
journalctl -u docker

# 6. Review recent changes (manual inspection required)
echo "6. Review recent changes (manual inspection required)"

# 7. Consult Kubernetes events
echo "7. Consulting Kubernetes events..."
kubectl get events --all-namespaces | grep "$NODE_NAME"

# 8. Check control plane health (requires cluster-admin access)
echo "8. Checking control plane health..."
kubectl get componentstatuses

# 9. Restart kubelet (requires root privileges)
echo "9. Restarting kubelet (requires root privileges)..."
sudo systemctl restart kubelet

# 10. Consult cloud provider documentation (if applicable)
echo "10. Consult cloud provider documentation (if applicable)"

Please note:

  • This script provides a basic framework. You might need to adjust commands and parameters based on your specific environment and configuration.
  • Some commands require root privileges.
  • Replace placeholders like your-node-name with actual values.
  • This script is for informational purposes only and should be used with caution. Always back up your data and proceed with caution when making changes to your Kubernetes cluster.

Additional Notes

General Tips:

  • Be systematic: Follow the steps in order, but don't be afraid to investigate clues as they appear.
  • Gather context: Note the time the issue started, recent changes, and any relevant error messages.
  • Check for patterns: Are multiple nodes affected? Do they share any similarities?
  • Use kubectl wisely: kubectl get nodes -o wide provides more details about nodes. kubectl describe node <node-name> offers in-depth information.

Specific Points:

  1. Node Status:

    • Conditions: Look for "False" conditions like Ready, NetworkUnavailable, MemoryPressure, DiskPressure.
    • Capacity vs. Allocatable: Discrepancies here can indicate resource reservation issues.
  2. Kubelet Logs:

    • Log levels: Increase verbosity if needed (--v=5 for more detail).
    • Time filters: Focus on logs around the time the issue began.
  3. Resource Availability:

    • Don't forget inodes: Exhausted inodes can cause issues even with free disk space.
    • Monitor trends: A gradual increase in resource usage can point to leaks.
  4. Network Connectivity:

    • DNS resolution: Ensure the node can resolve cluster DNS names.
    • Firewall rules: Verify no firewall rules are blocking communication.
  5. Container Runtime:

    • Runtime specific: Commands and logs vary (e.g., containerd, CRI-O).
    • Image pulls: Check if nodes can pull images from the registry.
  6. Recent Changes:

    • Roll back cautiously: Reverting changes can have unintended consequences.
    • Configuration management: Use tools like Git to track cluster changes.
  7. Kubernetes Events:

    • Filter effectively: Use -n <namespace> to focus on specific namespaces.
    • Event reasons: Pay attention to the "Reason" field for insights.
  8. Control Plane Health:

    • API server availability: The cluster's heart - ensure it's reachable.
    • etcd health: The cluster's data store - problems here are critical.
  9. Restart Kubelet:

    • Last resort: Restarting can mask underlying issues.
    • Drain node first: Safely evict pods before restarting to avoid disruption.
  10. Cloud Provider Documentation:

    • Cloud-specific issues: Network plugins, load balancers, etc., can cause problems.
    • Support channels: Utilize cloud provider support for complex issues.

Remember: Troubleshooting is a process of elimination. By systematically investigating each area, you'll increase your chances of identifying and resolving the root cause of the "NotReady" state.

Summary

This guide provides a 10-step approach to troubleshoot a Kubernetes node stuck in the "NotReady" state:

1. Node Status Inspection: Begin by examining the node's status using kubectl describe nodes, focusing on "Conditions", "Capacity", and "Allocatable" for errors or resource limitations.

2. Kubelet Log Analysis: Analyze the kubelet logs on the problematic node using journalctl -u kubelet, looking for errors related to container runtime, network, or resource exhaustion.

3. Resource Availability Verification: Ensure sufficient CPU, memory, and disk space on the node using tools like top, free, and df.

4. Network Connectivity Testing: Confirm network connectivity between the node, master node, and other cluster nodes using ping, telnet, or curl.

5. Container Runtime Inspection: Check the status and logs of your container runtime (e.g., Docker) using commands like systemctl status docker and journalctl -u docker.

6. Recent Change Review: Analyze recent cluster changes, such as deployments, configuration updates, or network modifications, and revert if necessary.

7. Kubernetes Event Examination: Review Kubernetes events for clues related to the "NotReady" state using kubectl get events --all-namespaces.

8. Control Plane Health Check: Ensure the Kubernetes control plane components (API server, scheduler, controller manager) are functioning correctly.

9. Kubelet Restart: If other steps fail, restart the kubelet service on the affected node with systemctl restart kubelet.

10. Cloud Provider Documentation: For cloud-based Kubernetes services, consult their documentation for specific troubleshooting steps related to node health.

Conclusion

By following this systematic approach, you can effectively diagnose and resolve issues causing Kubernetes nodes to be stuck in the NotReady state, ensuring a healthy and operational cluster. Remember to consult your cloud provider's documentation for specific guidance and support when working with managed Kubernetes services. Regular monitoring and proactive maintenance practices can help prevent many common node health problems.

References

Were You Able to Follow the Instructions?

😍Love it!
😊Yes
😐Meh-gical
😞No
🤮Clickbait