Troubleshooting a Broken Kubernetes Cluster
Every Kubernetes administrator will likely need to deal with a broken cluster at some point, whether a single node or the entire cluster is down. In this lab, you will be able to practice your troubleshooting skills. You will be presented with a broken Kubernetes cluster and asked to use your investigative skills to identify the problem and fix it.
Log in to the control plane node server using the credentials provided:
Determine What is Wrong with the Cluster
Find out which node is having a problem by using
kubectl get nodes. Identify if a node is in the NotReady state.
Get more information on the node by using
kubectl describe node <NODE_NAME>.
Look for the Conditions section of the Node Information and find out what is affecting the node’s status, causing it to fail.
Log in to the worker 2 node server using the credentials provided:
Look at the kubelet logs of the worker 2 node by using
sudo journalctl -u kubelet.
Go to the end of the log by pressing
Shift + Gand see the error messages stating that kubelet has stopped.
Look at the status of the kubelet status by using
sudo systemctl status kubelet, and note whether the kubelet service is running or not.
Fix the Problem
- In order to fix the problem, we need to not only start the server but also enable kubelet to ensure that it continues to work if the server restarts in the future. Use
clearto clear the service status, and then start and enable kubelet by using
sudo systemctl enable kubelet, followed by
sudo systemctl start kubelet.
- Check if kubelet is active by using
sudo systemctl status kubelet, and note if the service is listed as active (running).
- Return to the control plane node.
- Check if all nodes are now in a Ready status by using
kubectl get nodes. You may have to wait and try the command a few times before all nodes appear as Ready.
Troubleshooting a Broken Kubernetes Application
Kubernetes administrators need to be able to fix issues with applications running in a cluster. This lab will allow you to test your skills when it comes to fixing broken Kubernetes applications. You will be presented with a cluster running a broken application and asked to identify and correct the problem.
Log in to the server by using the credentials provided:
Identify What is Wrong with the Application
- Examine the
web-consumerdeployment, which resides in the
webnamespace, and its Pods by using
kubectl get deployment -n web web-consumer.
- Get more information by using
kubectl describe deployment -n web web-consumer, including container specifications and labels that the Pods are using.
- Look more closely at the Pods by using
kubectl get pods -n webto see if both Pods are up and running.
- Get more information about the Pods by using
kubectl describe pod -n web <POD_NAME>, and evaluate any warning messages that may come up.
- Look at the logs associated with the container
kubectl logs -n web <POD_NAME> -c busybox.
- Determine what may be going wrong by reading the output from the container logs.
- Take a closer look at the pod itself by using
kubectl get pod -n web <POD_NAME> -o yamlto get the data in the yaml format.
- Determine which command is causing the errors (in this case, the
while true; do curl auth-db; sleep 5; donecommand).
Fix the Problem
- Take a closer look at the service by using
kubectl get svc -n web auth-db.
- Locate where the service is by using
kubectl get namespacesand finding the one other non-default namespace called
kubectl get svc -n dataand find the
auth-dbservice in this namespace, rather than the
- Start resolving the issue by using
kubectl edit deployment -n web web-consumer.
- In the
specsection, scroll down to find the pod template and locate the
while true; do curl auth-db; sleep 5; donecommand.
- Change the command to
while true; do curl auth-db.data.svc.cluster.local; sleep 5; doneto give the fully qualified domain name of that service. This will allow the
web-consumerdeployment’s Pods to communicate with the service successfully.
- Save the file and exit by pressing the ESC key and using
kubectl get pods -n webto ensure that the old pods have terminated and the new pods are running successfully.
- Check the log of one of the new pods by using
kubectl logs -n web <POD-NAME> -c busybox. This time the pod should be able to communicate successfully with the service.