Troubleshooting a Broken Kubernetes Cluster
Introduction
Every Kubernetes administrator will likely need to deal with a broken cluster at some point, whether a single node or the entire cluster is down. In this lab, you will be able to practice your troubleshooting skills. You will be presented with a broken Kubernetes cluster and asked to use your investigative skills to identify the problem and fix it.
Solution
Log in to the control plane node server using the credentials provided:
ssh cloud_user@<PUBLIC_IP_ADDRESS>
Determine What is Wrong with the Cluster
Find out which node is having a problem by using
kubectl get nodes
. Identify if a node is in the NotReady state.Get more information on the node by using
kubectl describe node <NODE_NAME>.
Look for the Conditions section of the Node Information and find out what is affecting the node’s status, causing it to fail.
Log in to the worker 2 node server using the credentials provided:
ssh cloud_user@<PUBLIC_IP_ADDRESS>
Look at the kubelet logs of the worker 2 node by using
sudo journalctl -u kubelet
.Go to the end of the log by pressing
Shift + G
and see the error messages stating that kubelet has stopped.Look at the status of the kubelet status by using
sudo systemctl status kubelet
, and note whether the kubelet service is running or not.
Fix the Problem
- In order to fix the problem, we need to not only start the server but also enable kubelet to ensure that it continues to work if the server restarts in the future. Use
clear
to clear the service status, and then start and enable kubelet by usingsudo systemctl enable kubelet
, followed bysudo systemctl start kubelet
. - Check if kubelet is active by using
sudo systemctl status kubelet
, and note if the service is listed as active (running). - Return to the control plane node.
- Check if all nodes are now in a Ready status by using
kubectl get nodes
. You may have to wait and try the command a few times before all nodes appear as Ready.
Troubleshooting a Broken Kubernetes Application
Introduction
Kubernetes administrators need to be able to fix issues with applications running in a cluster. This lab will allow you to test your skills when it comes to fixing broken Kubernetes applications. You will be presented with a cluster running a broken application and asked to identify and correct the problem.
Solution
Log in to the server by using the credentials provided: ssh cloud_user@<PUBLIC_IP_ADDRESS>
.
Identify What is Wrong with the Application
- Examine the
web-consumer
deployment, which resides in theweb
namespace, and its Pods by usingkubectl get deployment -n web web-consumer
. - Get more information by using
kubectl describe deployment -n web web-consumer
, including container specifications and labels that the Pods are using. - Look more closely at the Pods by using
kubectl get pods -n web
to see if both Pods are up and running. - Get more information about the Pods by using
kubectl describe pod -n web <POD_NAME>
, and evaluate any warning messages that may come up. - Look at the logs associated with the container
busybox
by usingkubectl logs -n web <POD_NAME> -c busybox
. - Determine what may be going wrong by reading the output from the container logs.
- Take a closer look at the pod itself by using
kubectl get pod -n web <POD_NAME> -o yaml
to get the data in the yaml format. - Determine which command is causing the errors (in this case, the
while true; do curl auth-db; sleep 5; done
command).
Fix the Problem
- Take a closer look at the service by using
kubectl get svc -n web auth-db
. - Locate where the service is by using
kubectl get namespaces
and finding the one other non-default namespace calleddata
. - Check
kubectl get svc -n data
and find theauth-db
service in this namespace, rather than theweb
namespace. - Start resolving the issue by using
kubectl edit deployment -n web web-consumer
. - In the
spec
section, scroll down to find the pod template and locate thewhile true; do curl auth-db; sleep 5; done
command. - Change the command to
while true; do curl auth-db.data.svc.cluster.local; sleep 5; done
to give the fully qualified domain name of that service. This will allow theweb-consumer
deployment’s Pods to communicate with the service successfully. - Save the file and exit by pressing the ESC key and using
:wq
. - Check
kubectl get pods -n web
to ensure that the old pods have terminated and the new pods are running successfully. - Check the log of one of the new pods by using
kubectl logs -n web <POD-NAME> -c busybox
. This time the pod should be able to communicate successfully with the service.