Troubleshooting a Broken Kubernetes Cluster

Introduction

Every Kubernetes administrator will likely need to deal with a broken cluster at some point, whether a single node or the entire cluster is down. In this lab, you will be able to practice your troubleshooting skills. You will be presented with a broken Kubernetes cluster and asked to use your investigative skills to identify the problem and fix it.

Solution

Log in to the control plane node server using the credentials provided:

ssh cloud_user@<PUBLIC_IP_ADDRESS>

Determine What is Wrong with the Cluster

  1. Find out which node is having a problem by using kubectl get nodes. Identify if a node is in the NotReady state.

  2. Get more information on the node by using kubectl describe node <NODE_NAME>.

  3. Look for the Conditions section of the Node Information and find out what is affecting the node’s status, causing it to fail.

  4. Log in to the worker 2 node server using the credentials provided:

    ssh cloud_user@<PUBLIC_IP_ADDRESS>
    
  5. Look at the kubelet logs of the worker 2 node by using sudo journalctl -u kubelet.

  6. Go to the end of the log by pressing Shift + G and see the error messages stating that kubelet has stopped.

  7. Look at the status of the kubelet status by using sudo systemctl status kubelet, and note whether the kubelet service is running or not.

Fix the Problem

  1. In order to fix the problem, we need to not only start the server but also enable kubelet to ensure that it continues to work if the server restarts in the future. Use clear to clear the service status, and then start and enable kubelet by using sudo systemctl enable kubelet, followed by sudo systemctl start kubelet.
  2. Check if kubelet is active by using sudo systemctl status kubelet, and note if the service is listed as active (running).
  3. Return to the control plane node.
  4. Check if all nodes are now in a Ready status by using kubectl get nodes. You may have to wait and try the command a few times before all nodes appear as Ready.

Troubleshooting a Broken Kubernetes Application

Introduction

Kubernetes administrators need to be able to fix issues with applications running in a cluster. This lab will allow you to test your skills when it comes to fixing broken Kubernetes applications. You will be presented with a cluster running a broken application and asked to identify and correct the problem.

Solution

Log in to the server by using the credentials provided: ssh cloud_user@<PUBLIC_IP_ADDRESS>.

Identify What is Wrong with the Application

  1. Examine the web-consumer deployment, which resides in the web namespace, and its Pods by using kubectl get deployment -n web web-consumer.
  2. Get more information by using kubectl describe deployment -n web web-consumer, including container specifications and labels that the Pods are using.
  3. Look more closely at the Pods by using kubectl get pods -n web to see if both Pods are up and running.
  4. Get more information about the Pods by using kubectl describe pod -n web <POD_NAME>, and evaluate any warning messages that may come up.
  5. Look at the logs associated with the container busybox by using kubectl logs -n web <POD_NAME> -c busybox.
  6. Determine what may be going wrong by reading the output from the container logs.
  7. Take a closer look at the pod itself by using kubectl get pod -n web <POD_NAME> -o yaml to get the data in the yaml format.
  8. Determine which command is causing the errors (in this case, the while true; do curl auth-db; sleep 5; done command).

Fix the Problem

  1. Take a closer look at the service by using kubectl get svc -n web auth-db.
  2. Locate where the service is by using kubectl get namespaces and finding the one other non-default namespace called data.
  3. Check kubectl get svc -n data and find the auth-db service in this namespace, rather than the web namespace.
  4. Start resolving the issue by using kubectl edit deployment -n web web-consumer.
  5. In the spec section, scroll down to find the pod template and locate the while true; do curl auth-db; sleep 5; done command.
  6. Change the command to while true; do curl auth-db.data.svc.cluster.local; sleep 5; done to give the fully qualified domain name of that service. This will allow the web-consumer deployment’s Pods to communicate with the service successfully.
  7. Save the file and exit by pressing the ESC key and using :wq.
  8. Check kubectl get pods -n web to ensure that the old pods have terminated and the new pods are running successfully.
  9. Check the log of one of the new pods by using kubectl logs -n web <POD-NAME> -c busybox. This time the pod should be able to communicate successfully with the service.