How to Break Production on Black Friday

2024-11-22 Go

Users won’t do this! No way! Why would they enter such values? It doesn’t make sense! Does it sound familiar? How many times have you said it to yourself?

Well… You are about to see and learn how a missing test puts thousands of production Kubernetes systems in danger.

Black Friday and Cyber Monday are approaching. Preparations for this time of the year are underway. Business and tech departments are working together to boost your company’s sales. System admins, site readability engineers (SREs), and developers are on high alert.

Everyone watches dashboards and thinks about the PagerDuty app on their phones. Be quiet, be quiet.

So far, so good. Traffic to services is growing, sales are growing. Infrastructure and business services are performing as predicted. Your Slack channels are full of “Congrats to the team” and “excellent job, guys” messages. Happy days!

Meanwhile, your boss, encouraged by the higher sales, decided to launch a few more web services to boost sales further. In the end, the engineers tested and verified the new software. They didn’t find any issues. You are confident that the team can go ahead.

Green light to deploy

You’ve got the green light to deploy services! 3, 2, 1, start!

You run production and test infrastructure in many Kubernetes clusters. You know that the new services you are about to deploy require API keys and secrets for authentication.

Your cluster handles all incoming traffic using the Kubernetes Ingress Controller. The controller is the gateway. It’s the brain that accepts web requests and redirects them to services. The services are your company’s business applications.

You rely on the popular NGINX Ingress Controller to handle the traffic. It’s powered by the most popular web server and load balancer in the world – NGINX.

You use custom resource definitions (CRDs) to model virtual servers and routing. You also use CRDs to model and define API keys and secrets required for a few existing and new web services.

Deployment (disaster) time

You don’t manage Kubernetes deployments manually. You automate almost everything. In the end, you are a big fan of GitOps for managing cloud-native apps and the cloud infrastructure.

But this time you decided to configure the API credentials yourself. In the end, it is just a few small Kubernetes objects.

All that it takes to deploy the new objects is:

Create and deploy a policy object.
Create and deploy API keys and certs.
Deploy your new services and VirtualServers.

You create a Secret

apiVersion: v1
kind: Secret
metadata:
  name: api-key-client-secret
type: nginx.org/apikey
data:
    client1: cGFzc3dvcmQ= # password
    client2: YW5vdGhlci1wYXNzd29yZA== # another-password

Next, you model the Policy and associate it with the Secret.

apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
  name: api-key-policy
spec:
  apiKey:
    suppliedIn:
      header:
      - "X-API"
    clientSecret: api-key-client-secret

At this stage, all goes to plan. You check that five replicas of the NGINX Ingress Controller are operational. All looks good.

kubectl get pods -n nginx-ingress

NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-8hhlv 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-bbvjb 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-k949q 1/1 Running 0 31s
nginx-ingress-74cfd9c5cf-sp2jr 1/1 Running 0 31s

You check your Grafana dashboard. There are no problems with CPU or memory.

Next, you deploy the Secret

kubectl apply -f api-key-secret.yaml

and the Policy

kubectl apply -f api-key-policy.yaml

One more check if the objects are in the cluster, and you are ready for the next step.

kubectl get secrets api-key-client-secret

NAME                    TYPE               DATA   AGE
api-key-client-secret   nginx.org/apikey   2      3d11h

kubectl get policies.k8s.nginx.org api-key-policy

NAME             STATE   AGE
api-key-policy   Valid   138m

Policy is valid. Secret is present. All components are in place. Clusters are operational. Production systems are healthy. Business as usual.

Now it’s time to deploy services and VirtualServers. But you forgot to apply one more policy. So, you edit the policy YAML file:

apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
  name: api-key-policy
spec:
  apiKey:
    suppliedIn:
    clientSecret: api-key-client-secret

And save the file and type in the terminal.

kubectl apply -f api-key-policy.yaml

What the heck?

All of a sudden, you hear and feel your mobile. Bzzzzz, Brrrr. Your teammates look at their phones almost at the same time.

It’s the PagerDuty alarm! Production is DOWN.

What the heck, you scream! I bet the expression is more like WTF or rather FFS or some equivalent in your mother tongue.

You check the dashboards. Most of them scream FAILURE in red. You type in the terminal:

kubectl get pods -n nginx-ingress

and get

NAME READY STATUS RESTARTS AGE

nginx-ingress-74cfd9c5cf-7j8nb 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-8hhlv 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-k949q 0/1 Error 1 (3s ago) 30m
nginx-ingress-74cfd9c5cf-sp2jr 0/1 Error 1 (3s ago) 30m

You check one more time.

kubectl get pods -n nginx-ingress

and get

NAME READY STATUS RESTARTS AGE

nginx-ingress-74cfd9c5cf-7j8nb 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-8hhlv 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-k949q 0/1 CrashLoopBackOff 1 (8s ago) 30m
nginx-ingress-74cfd9c5cf-sp2jr 0/1 CrashLoopBackOff 1 (8s ago) 30m

Someone from your team shouts “Check the logs!”. The results are terrifying.

E1119 05:49:06.668128       1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
        goroutine 957 [running]:
        k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0x98
        k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40}, {0x39c3d20, 0x0, 0x4000b93598?})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:82 +0x60
        k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4000d4e700?})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:59 +0x114
        panic({0x1c76a40?, 0x3957b40?})
                runtime/panic.go:785 +0x124
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validateAPIKey(0x40005c4588, 0x4000900fc0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:297 +0x2c
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validatePolicySpec(0x4000c285e8, 0x4000900f90, 0x0, 0x0, 0x0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:76 +0x814
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.ValidatePolicy(0x4000c284e0, 0x0, 0x0, 0x0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:19 +0x74
        github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).syncPolicy(0x400016c288, {0x16?, {0x4000b18390?, 0x4000102f08?}})
                github.com/nginxinc/kubernetes-ingress/internal/k8s/policy.go:74 +0x180
        github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).sync(0x400016c288, {0x1b1c1e0?, {0x4000b18390?, 0x0?}})
                github.com/nginxinc/kubernetes-ingress/internal/k8s/controller.go:959 +0x478
...
 >
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xa1fc4c]

All replicas are down. No traffic reaches production. The phone rings - it’s the boss.

Don’t panic

Even if you see the panic in the stack trace, by all means, do not panic!

You check the last Policy you have applied:

apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
  name: api-key-policy
spec:
  apiKey:
    suppliedIn:
    clientSecret: api-key-client-secret

Aha! It’s the missing suppliedIn value! But why on Earth did the controllers crash? All ingress controllers? Shouldn’t the system reject the invalid Policy object?

You check in a hurry.

kubectl get policies.k8s.nginx.org api-key-policy

NAME             STATE   AGE
api-key-policy   Valid   144m

Policy status is VALID!

Once again, you check.

kubectl describe policies.k8s.nginx.org api-key-policy

...
Status:
  Message:  Policy default/api-key-policy was added or updated
  Reason:   AddedOrUpdated
  State:    Valid
Events:     <none>

The system reports the State and Status as Valid! It’s evident that all your ingress controllers have CRASHED! What is going on?

You fixed the Policy, added the missing value, and applied the corrected object.

After a few seconds the controllers coming back to life.

kubectl get pods -n nginx-ingress

NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 1/1 Running 23 (5m11s ago) 2d23h
nginx-ingress-74cfd9c5cf-8hhlv 1/1 Running 23 (5m10s ago) 2d23h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 CrashLoopBackOff 22 (5m8s ago) 2d23h
nginx-ingress-74cfd9c5cf-k949q 0/1 CrashLoopBackOff 27 (4m18s ago) 10h
nginx-ingress-74cfd9c5cf-sp2jr 1/1 Running 24 (5m8s ago) 10h

The production system recovers and accepts requests again. You saved the business.

But as a professional, you cannot believe in what has happened. A small typo in the object’s not associated with running services can shut the business down.

You know that

With great power comes great responsibility.

Well, as you have experienced, sometimes the great power brings great surprises. Surprises that in this case bring the production Kubernetes traffic down.

In the How to Prevent Panics in Go, we analyse what has happened and why the NGINX Ingress Controller crashed. With surgical precision, we dismantle the function that validates inputs.

Finally, in the How to Write Better Tests in Go, we start fixing the issues using the test-first approach and the 4 question framework.

If you manage Kubernetes clusters, be aware that this bug affects NGINX Ingress Controller v3.7.0 or earlier! You can reproduce the steps in your cluster. But, as they say:

Don’t try it at home!

“At work,” I should say!

Until next time!

Jakub