How to Break Production on Black Friday
Users won’t do this! No way! Why would they enter such values? It doesn’t make sense! Does it sound familiar? How many times have you said it to yourself?
Well… You are about to see and learn how a missing test puts thousands of production Kubernetes systems in danger.
Black Friday and Cyber Monday are approaching. Preparations for this time of the year are underway. Business and tech departments are working together to boost your company’s sales. System admins, site readability engineers (SREs), and developers are on high alert.
Everyone watches dashboards and thinks about the PagerDuty app on their phones. Be quiet, be quiet.
So far, so good. Traffic to services is growing, sales are growing. Infrastructure and business services are performing as predicted. Your Slack channels are full of “Congrats to the team” and “excellent job, guys” messages. Happy days!
Meanwhile, your boss, encouraged by the higher sales, decided to launch a few more web services to boost sales further. In the end, the engineers tested and verified the new software. They didn’t find any issues. You are confident that the team can go ahead.
Green light to deploy
You have got the green light to deploy services! 3, 2, 1, start!
You run production and test infrastructure in many Kubernetes clusters. You know that the new services you are about to deploy require API keys and secrets for authentication.
Your cluster handles all incoming traffic using the Kubernetes Ingress Controller. The controller is the gateway. It’s the “brain” that accepts web requests and redirects them to services. The services are your company’s business applications.
You rely on the popular NGINX Ingress Controller to handle the traffic. It’s powered by the most popular web server and load balancer in the world – NGINX.
You use custom resource definitions (CRDs) to model virtual servers and routing. You also use CRDs to model and define API keys and secrets required for a few existing and new web services.
Deployment (disaster) time
Most of your Kubernetes application deployments are managed by automated processes. In the end, you are a big fan of GitOps for managing cloud-native apps and the cloud infrastructure.
But this time you decided to configure the API credentials yourself, and later to automate the process. In the end, it is “just” a few “small” Kubernetes objects.
All that it takes to deploy the new objects is:
- Create and deploy a policy object.
- Create and deploy API keys and certs.
- Deploy your new services and VirtualServers.
You create a Secret
apiVersion: v1
kind: Secret
metadata:
name: api-key-client-secret
type: nginx.org/apikey
data:
client1: cGFzc3dvcmQ= # password
client2: YW5vdGhlci1wYXNzd29yZA== # another-password
Next, you model the Policy
and associate it with the Secret
.
apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
name: api-key-policy
spec:
apiKey:
suppliedIn:
header:
- "X-API"
clientSecret: api-key-client-secret
At this stage, all goes to plan. You check that five replicas of the NGINX Ingress Controller are operational. All looks safe.
kubectl get pods -n nginx-ingress
NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-8hhlv 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-bbvjb 1/1 Running 0 2d12h
nginx-ingress-74cfd9c5cf-k949q 1/1 Running 0 31s
nginx-ingress-74cfd9c5cf-sp2jr 1/1 Running 0 31s
You check your Grafana dashboard. There are no problems with CPU or memory.
Next, you deploy the Secret
kubectl apply -f api-key-secret.yaml
and the Policy
kubectl apply -f api-key-policy.yaml
One more check if the objects are present in the cluster, and you are ready for the next step.
kubectl get secrets api-key-client-secret
NAME TYPE DATA AGE
api-key-client-secret nginx.org/apikey 2 3d11h
kubectl get policies.k8s.nginx.org api-key-policy
NAME STATE AGE
api-key-policy Valid 138m
Policy is valid. Secret is present. All components are in place. Clusters are operational. Production systems are healthy. Business as usual.
Now it’s time to deploy services and VirtualServers. But you forgot to apply one more policy. So, you edit the policy YAML file:
apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
name: api-key-policy
spec:
apiKey:
suppliedIn:
clientSecret: api-key-client-secret
You save the file and type in the terminal.
kubectl apply -f api-key-policy.yaml
What the heck?
All of a sudden, you hear and feel your mobile phone. Bzzzzz, Brrrr. All your teammates look at their phones almost at the same time.
It’s the PagerDuty alarm! “Production is DOWN.”
What the heck, you scream! I bet the expression is more like WTF or rather FFS or some equivalent in your mother tongue.
You check the dashboards. Most of them scream “FAILURE” in red. You type in the terminal:
kubectl get pods -n nginx-ingress
and you see
NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-8hhlv 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 Error 1 (3s ago) 2d13h
nginx-ingress-74cfd9c5cf-k949q 0/1 Error 1 (3s ago) 30m
nginx-ingress-74cfd9c5cf-sp2jr 0/1 Error 1 (3s ago) 30m
You check one more time.
kubectl get pods -n nginx-ingress
and you see
NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-8hhlv 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 CrashLoopBackOff 1 (8s ago) 2d13h
nginx-ingress-74cfd9c5cf-k949q 0/1 CrashLoopBackOff 1 (8s ago) 30m
nginx-ingress-74cfd9c5cf-sp2jr 0/1 CrashLoopBackOff 1 (8s ago) 30m
Someone from your team shouts “Check the logs!”. The results are terrifying.
E1119 05:49:06.668128 1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
goroutine 957 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40})
k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0x98
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40}, {0x39c3d20, 0x0, 0x4000b93598?})
k8s.io/[email protected]/pkg/util/runtime/runtime.go:82 +0x60
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4000d4e700?})
k8s.io/[email protected]/pkg/util/runtime/runtime.go:59 +0x114
panic({0x1c76a40?, 0x3957b40?})
runtime/panic.go:785 +0x124
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validateAPIKey(0x40005c4588, 0x4000900fc0)
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:297 +0x2c
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validatePolicySpec(0x4000c285e8, 0x4000900f90, 0x0, 0x0, 0x0)
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:76 +0x814
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.ValidatePolicy(0x4000c284e0, 0x0, 0x0, 0x0)
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:19 +0x74
github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).syncPolicy(0x400016c288, {0x16?, {0x4000b18390?, 0x4000102f08?}})
github.com/nginxinc/kubernetes-ingress/internal/k8s/policy.go:74 +0x180
github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).sync(0x400016c288, {0x1b1c1e0?, {0x4000b18390?, 0x0?}})
github.com/nginxinc/kubernetes-ingress/internal/k8s/controller.go:959 +0x478
...
>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xa1fc4c]
All replicas are down. No traffic reaches production! The telephone rings - it’s the boss.
Don’t panic
Even if you see the panic
in the stack trace, by all means, do not panic!
You check the last Policy
you have applied:
apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
name: api-key-policy
spec:
apiKey:
suppliedIn:
clientSecret: api-key-client-secret
Aha! It’s the missing suppliedIn
value! But why on Earth did the controllers crash? All ingress controllers? Shouldn’t the system reject the invalid Policy
object?
You check in a hurry.
kubectl get policies.k8s.nginx.org api-key-policy
NAME STATE AGE
api-key-policy Valid 144m
Policy status is VALID
!
Once again, you check.
kubectl describe policies.k8s.nginx.org api-key-policy
...
Status:
Message: Policy default/api-key-policy was added or updated
Reason: AddedOrUpdated
State: Valid
Events: <none>
The system reports the State
and Status
as Valid
! It’s evident that all your ingress controllers have CRASHED! What is going on?
You fixed the Policy
, added the missing value, and applied the corrected object.
After a few seconds, you see the controllers coming back to life.
kubectl get pods -n nginx-ingress
NAME READY STATUS RESTARTS AGE
nginx-ingress-74cfd9c5cf-7j8nb 1/1 Running 23 (5m11s ago) 2d23h
nginx-ingress-74cfd9c5cf-8hhlv 1/1 Running 23 (5m10s ago) 2d23h
nginx-ingress-74cfd9c5cf-bbvjb 0/1 CrashLoopBackOff 22 (5m8s ago) 2d23h
nginx-ingress-74cfd9c5cf-k949q 0/1 CrashLoopBackOff 27 (4m18s ago) 10h
nginx-ingress-74cfd9c5cf-sp2jr 1/1 Running 24 (5m8s ago) 10h
The production system recovers and accepts requests again. You saved the business.
But as a professional, you cannot believe in what has happened. A small typo in the object’s not associated with running services can shut the business down!
You know that
With great power comes great responsibility.
Well, as you have experienced, sometimes the great power brings great surprises. Surprises that in this case bring the production Kubernetes traffic down!
In the How to Prevent Panics in Go, we analyse what has happened and why the NGINX Ingress Controller crashed. With surgical precision, we dismantle the function that validates inputs.
Finally, in the How to Write Better Tests in Go, we start fixing the issues using the test-first approach and the 4 question framework.
If you manage Kubernetes clusters, be aware that this bug affects NGINX Ingress Controller v3.7.0 or earlier! You can reproduce the steps in your cluster. But, as they say:
Don’t try it at home!
“At work,” I should say!
Until next week!
Jakub