Kubernetes

Debugging Kubernetes

Matt

Jan 19, 2024 • 5 min read

Earlier this morning I was debugging what turned out to be a issue with the node configuration in my local k8 dev cluster and it got me thinking about how easy it is to generally debug these things ... so here we go!

Background

Last weekend I decided to increase the amount of storage available to the pods in my local dev cluster to ensure more ephemeral storage was available. At the same time I decided to add a new node to my cluster ... this is where it turns out the issue actually resided... PEBCAK!!!

Anyway, the project in question was a local Harbor install that I use for developing / debugging apps and services locally on my homelab.

Now then, Harbor is a complicated beast when it comes to running the standard helm chart and looks something like:

$ kubectl get pods --namespace=harbor
NAME                                 READY   STATUS    RESTARTS  AGE
harbor-core-6c57bb9c78-69w8r         1/1     Running   0         82m
harbor-database-0                    1/1     Running   0         82m
harbor-jobservice-78567ccbb7-gmhtn   1/1     Running   0         82m
harbor-portal-68bcb5dd4c-4xmkw       1/1     Running   0         82m
harbor-redis-0                       1/1     Running   0         82m
harbor-registry-7b8cbc8546-wwk6c     2/2     Running   0         82m
harbor-trivy-0                       1/1     Running   0         82m

In essence, a lot of moving parts that all integrate to provide a rather sweet set of tools when it comes to providing a registry and image scanning tools.

The issue I was facing was that the harbor-core pod could not speak to the harbor-redis pod. To cut a long story short, the issue turned out to be related to firewalld on the new node due to certain necessary ports not being allowed. Specifically, the ports 179/tcp (for BGP), 4789/udp and 8472/udp (for VXLAN), 30000-32767/tcp (for NodePort Services), 10255/tcp and 10250/tcp (for kubelet APIs) needed to be open. This was a configuration oversight and not an inherent issue with Harbor itself i.e. PEBCAK issue😁

The problem

Okay so faced with the problem

Back-off restarting failed container core in pod harbor-core...

As we all know the usual way to debug this is to check out the pod descriptors and logs to see if there's anything obvious going on.

$ kubectl describe pod harbor-core-6c57bb9c78-69w8r
...
...
$ kubectl logs harbor-core-6c57bb9c78-69w8r
...
2024-01-19 13:30:51	
2024-01-19T13:30:51Z [ERROR] [/lib/cache/cache.go:124]: failed to ping redis://harbor-redis.harbor:6379/0?idle_timeout_seconds=60, retry after 723.423621ms : dial tcp: lookup harbor-redis.harbor: i/o timeout
...

Based on ... it's a connection issue between harbor-core and harbor-redis. Now the tricky part as you probably know is how to debug a connection issue when there are zero debug tools in 99% of the images i.e. nslookup, ping, curl, wget, redis-cli etc. etc.

How on earth to debug this?!?!?!

The solution

Well, after using docker for the past 5+ years, I've got a set of tools for debugging stuff like this...

Actually, finding the issue was reasonably straight forward and required me to deploy a new docker image into the namespace alongside the harbor-core and harbor-redis pods i.e. a debugger image that contains tools such as ping, curl, telnet, and more importantly redis-cli and nslookup / dig.

Historically, I leveraged a image on for all things networky called tutum/dsnutils which is basically an old, unmaintained docker image that contains dnsutils (nslookup and a few others).

It was at this point I decided to build a new and improved debug image that I can use with a variety of different scenarios and essentially contains a toolkit that consists of

nslookup / dig - used for validating DNS settings
mysql-client - used to connect to MySQL instances
redis-cli - used to connect to Redis instances
curl / wget - used to test http(s) connectivity
telnet - general tool for testing interconnectivity
+plus a few other useful tools

After about 10 mins I had an image built and deployed on https://hub.docker.com/r/gizzmoasus/debugger that I could pull into my harbor namespace and run some simple debug tests by leveraging the following simple deployment yaml:

apiVersion: v1
kind: Pod
metadata:
  name: debugger
spec:
  containers:
  - name: debugger
    image: gizzmoasus/debugger:latest
    command:
      - sleep
      - "infinity"
    imagePullPolicy: Always
  restartPolicy: Always

$ kubectl get pods --namespace=harbor
NAME                                 READY   STATUS    RESTARTS  AGE
debugger                             1/1     Running   0         121m
harbor-core-6c57bb9c78-69w8r         1/1     Running   0         196m
harbor-database-0                    1/1     Running   0         196m
harbor-jobservice-78567ccbb7-gmhtn   1/1     Running   0         196m
harbor-portal-68bcb5dd4c-4xmkw       1/1     Running   0         196m
harbor-redis-0                       1/1     Running   0         196m
harbor-registry-7b8cbc8546-wwk6c     2/2     Running   0         196m
harbor-trivy-0                       1/1     Running   0         196m

Now I had the debugger image running in the same namespace as the harbor pods I can begin the task of figuring out the problem really is...

Funnily enough, the first command I ran allowed me to identify the issue was related to DNS ...

$ kubectl exec -it debugger -- nslookup harbor-redis
;; connection timed out; no servers could be reached

command terminated with exit code 1

This is a huuuuge red flag that points to a failure in the cluster DNS, so I went to the one place you typically go to for stuff like that this ... the kubernetes docs

The downside to these docs is they run through the happy path i.e. look at what you should see 😆 (flash backs to bullseye ... look at what you could have won).

Sooooo the issue is related to DNS so I switch namespaces and check the pods/services are up and running healthily:

$ kubectl get pods,svc --namespace=kube-system | grep dns
NAME                                      READY   STATUS    RESTARTS       AGE
coredns-76f75df574-6wr99                  1/1     Running   0              74h4m
coredns-76f75df574-mjsrh                  1/1     Running   0              74h4m
coredns-76f75df574-qfxp9                  1/1     Running   0              74h4m
coredns-76f75df574-qsqqg                  1/1     Running   0              74h4m

NAME                                                   TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                        AGE
service/kube-dns                                       ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP         17d

Looks like the coredns pods and services are running fine and there is indeed an IP allocated to the service.

Anyway, long story short I noticed that DNS was working fine on the other nodes and determined it was a firewall issue on the new node I added over the weekend (essentially by disabling firewalld, refresh the pod and everything worked fine).

The point to the tail was that having a solid approach to debugging and a solid toolchain for debugging is important to have when it comes to running apps and services in Kubernetes clusters.

Feel free to check out the docker hub repository and suggest additional tools that would be useful to add in to this image over at

To use the image simply checkout https://hub.docker.com/r/gizzmoasus/debugger.