Learn the common causes and troubleshooting steps for Kubernetes nodes stuck in 'Not Ready' state, covering network issues, resource constraints, kubelet problems, and more.
Troubleshooting Kubernetes nodes stuck in a NotReady state can be challenging. This guide provides a systematic approach to identify and resolve common issues. We'll cover checking node status, inspecting logs, verifying resources, and more.
Check node status: Start by examining the status of your nodes using:
kubectl describe nodes
Pay close attention to the "Conditions", "Capacity", and "Allocatable" sections for any error messages or resource constraints.
Inspect kubelet logs: The kubelet is responsible for node health. Access its logs on the problematic node with:
journalctl -u kubelet
Look for errors related to container runtime, network issues, or resource exhaustion.
Verify resource availability: Ensure sufficient CPU, memory, and disk space on the node. Use tools like top
, free
, and df
to monitor resource usage.
Check network connectivity: Confirm that the node can communicate with the master node and other nodes in the cluster. Use ping
, telnet
, or curl
to test connectivity.
Inspect container runtime: If you're using Docker, check its status and logs:
systemctl status docker
journalctl -u docker
For other container runtimes, consult their respective documentation.
Review recent changes: Consider any recent changes to the cluster, such as deployments, configuration updates, or network modifications. Revert changes if necessary.
Consult Kubernetes events: Examine Kubernetes events for clues related to the node's NotReady state:
kubectl get events --all-namespaces
Check control plane health: Ensure the Kubernetes control plane components (API server, scheduler, controller manager) are functioning correctly.
Restart kubelet: If other steps don't reveal the issue, restarting the kubelet service on the affected node might help:
systemctl restart kubelet
Consult cloud provider documentation: For cloud-based Kubernetes services like EKS or AKS, refer to their documentation for specific troubleshooting steps related to node health.
This Bash script helps troubleshoot a Kubernetes node stuck in the NotReady state. It provides a step-by-step guide to check node status, inspect logs (kubelet, Docker), verify resource availability (CPU, memory, disk), test network connectivity, review recent changes, consult Kubernetes events, check control plane health, restart kubelet (requires root), and consult cloud provider documentation. Remember to replace placeholders with actual values and exercise caution as some commands require root privileges.
This script provides examples for each troubleshooting step outlined in the article.
#!/bin/bash
# Node name (replace with the actual node name)
NODE_NAME="your-node-name"
# 1. Check node status
echo "1. Checking node status..."
kubectl describe nodes "$NODE_NAME"
# 2. Inspect kubelet logs
echo "2. Inspecting kubelet logs..."
journalctl -u kubelet | grep "$NODE_NAME"
# 3. Verify resource availability
echo "3. Verifying resource availability..."
echo "CPU usage:"
top -b -n 1 | head -n 5
echo "Memory usage:"
free -h
echo "Disk space:"
df -h
# 4. Check network connectivity
echo "4. Checking network connectivity..."
ping -c 4 google.com
telnet google.com 80
curl -I google.com
# 5. Inspect container runtime (Docker example)
echo "5. Inspecting container runtime (Docker)..."
systemctl status docker
journalctl -u docker
# 6. Review recent changes (manual inspection required)
echo "6. Review recent changes (manual inspection required)"
# 7. Consult Kubernetes events
echo "7. Consulting Kubernetes events..."
kubectl get events --all-namespaces | grep "$NODE_NAME"
# 8. Check control plane health (requires cluster-admin access)
echo "8. Checking control plane health..."
kubectl get componentstatuses
# 9. Restart kubelet (requires root privileges)
echo "9. Restarting kubelet (requires root privileges)..."
sudo systemctl restart kubelet
# 10. Consult cloud provider documentation (if applicable)
echo "10. Consult cloud provider documentation (if applicable)"
Please note:
your-node-name
with actual values.General Tips:
kubectl get nodes -o wide
provides more details about nodes. kubectl describe node <node-name>
offers in-depth information.Specific Points:
Node Status:
Ready
, NetworkUnavailable
, MemoryPressure
, DiskPressure
.Kubelet Logs:
--v=5
for more detail).Resource Availability:
Network Connectivity:
Container Runtime:
Recent Changes:
Kubernetes Events:
-n <namespace>
to focus on specific namespaces.Control Plane Health:
Restart Kubelet:
Cloud Provider Documentation:
Remember: Troubleshooting is a process of elimination. By systematically investigating each area, you'll increase your chances of identifying and resolving the root cause of the "NotReady" state.
This guide provides a 10-step approach to troubleshoot a Kubernetes node stuck in the "NotReady" state:
1. Node Status Inspection: Begin by examining the node's status using kubectl describe nodes
, focusing on "Conditions", "Capacity", and "Allocatable" for errors or resource limitations.
2. Kubelet Log Analysis: Analyze the kubelet logs on the problematic node using journalctl -u kubelet
, looking for errors related to container runtime, network, or resource exhaustion.
3. Resource Availability Verification: Ensure sufficient CPU, memory, and disk space on the node using tools like top
, free
, and df
.
4. Network Connectivity Testing: Confirm network connectivity between the node, master node, and other cluster nodes using ping
, telnet
, or curl
.
5. Container Runtime Inspection: Check the status and logs of your container runtime (e.g., Docker) using commands like systemctl status docker
and journalctl -u docker
.
6. Recent Change Review: Analyze recent cluster changes, such as deployments, configuration updates, or network modifications, and revert if necessary.
7. Kubernetes Event Examination: Review Kubernetes events for clues related to the "NotReady" state using kubectl get events --all-namespaces
.
8. Control Plane Health Check: Ensure the Kubernetes control plane components (API server, scheduler, controller manager) are functioning correctly.
9. Kubelet Restart: If other steps fail, restart the kubelet service on the affected node with systemctl restart kubelet
.
10. Cloud Provider Documentation: For cloud-based Kubernetes services, consult their documentation for specific troubleshooting steps related to node health.
By following this systematic approach, you can effectively diagnose and resolve issues causing Kubernetes nodes to be stuck in the NotReady state, ensuring a healthy and operational cluster. Remember to consult your cloud provider's documentation for specific guidance and support when working with managed Kubernetes services. Regular monitoring and proactive maintenance practices can help prevent many common node health problems.