Troubleshooting a Broken Kubernetes Cluster
Introduction
Every Kubernetes administrator will likely need to deal with a broken cluster at some point, whether a single node or the entire cluster is down. In this lab, you will be able to practice your troubleshooting skills. You will be presented with a broken Kubernetes cluster and asked to use your investigative skills to identify the problem and fix it.
Solution
Log in to the control plane node server using the credentials provided:
ssh cloud_user@<PUBLIC_IP_ADDRESS>
Determine What is Wrong with the Cluster
Find out which node is having a problem by using
kubectl get nodes. Identify if a node is in the NotReady state.Get more information on the node by using
kubectl describe node <NODE_NAME>.Look for the Conditions section of the Node Information and find out what is affecting the node’s status, causing it to fail.
Log in to the worker 2 node server using the credentials provided:
ssh cloud_user@<PUBLIC_IP_ADDRESS>Look at the kubelet logs of the worker 2 node by using
sudo journalctl -u kubelet.Go to the end of the log by pressing
Shift + Gand see the error messages stating that kubelet has stopped.Look at the status of the kubelet status by using
sudo systemctl status kubelet, and note whether the kubelet service is running or not.
Fix the Problem
- In order to fix the problem, we need to not only start the server but also enable kubelet to ensure that it continues to work if the server restarts in the future. Use
clearto clear the service status, and then start and enable kubelet by usingsudo systemctl enable kubelet, followed bysudo systemctl start kubelet. - Check if kubelet is active by using
sudo systemctl status kubelet, and note if the service is listed as active (running). - Return to the control plane node.
- Check if all nodes are now in a Ready status by using
kubectl get nodes. You may have to wait and try the command a few times before all nodes appear as Ready.
Troubleshooting a Broken Kubernetes Application
Introduction
Kubernetes administrators need to be able to fix issues with applications running in a cluster. This lab will allow you to test your skills when it comes to fixing broken Kubernetes applications. You will be presented with a cluster running a broken application and asked to identify and correct the problem.
Solution
Log in to the server by using the credentials provided: ssh cloud_user@<PUBLIC_IP_ADDRESS>.
Identify What is Wrong with the Application
- Examine the
web-consumerdeployment, which resides in thewebnamespace, and its Pods by usingkubectl get deployment -n web web-consumer. - Get more information by using
kubectl describe deployment -n web web-consumer, including container specifications and labels that the Pods are using. - Look more closely at the Pods by using
kubectl get pods -n webto see if both Pods are up and running. - Get more information about the Pods by using
kubectl describe pod -n web <POD_NAME>, and evaluate any warning messages that may come up. - Look at the logs associated with the container
busyboxby usingkubectl logs -n web <POD_NAME> -c busybox. - Determine what may be going wrong by reading the output from the container logs.
- Take a closer look at the pod itself by using
kubectl get pod -n web <POD_NAME> -o yamlto get the data in the yaml format. - Determine which command is causing the errors (in this case, the
while true; do curl auth-db; sleep 5; donecommand).
Fix the Problem
- Take a closer look at the service by using
kubectl get svc -n web auth-db. - Locate where the service is by using
kubectl get namespacesand finding the one other non-default namespace calleddata. - Check
kubectl get svc -n dataand find theauth-dbservice in this namespace, rather than thewebnamespace. - Start resolving the issue by using
kubectl edit deployment -n web web-consumer. - In the
specsection, scroll down to find the pod template and locate thewhile true; do curl auth-db; sleep 5; donecommand. - Change the command to
while true; do curl auth-db.data.svc.cluster.local; sleep 5; doneto give the fully qualified domain name of that service. This will allow theweb-consumerdeployment’s Pods to communicate with the service successfully. - Save the file and exit by pressing the ESC key and using
:wq. - Check
kubectl get pods -n webto ensure that the old pods have terminated and the new pods are running successfully. - Check the log of one of the new pods by using
kubectl logs -n web <POD-NAME> -c busybox. This time the pod should be able to communicate successfully with the service.