Jakub Jarosz

Security, Systems & Network Automation

How to prevent panics in Go

2024-12-04 Go

Imagine you are sitting relaxed in a self-driving car. You are taking a sip of your afternoon tea. You are going to your friend’s wedding.

You are getting closer to a crossroad, and your car is about to change direction. Suddenly, the car detects an obstacle on the road—a fallen tree!

How can you not panic?


This is the second of a three-part series about how a missing Go test in the NGINX Ingress Controller puts thousands of Kubernetes clusters at risk.

  1. How to break production on Black Friday.
  2. How to prevent panics in Go.

A lot happened last week! You pushed a small change to the Kubernetes cluster. It broke the production environment. You became “famous” in your company, but you quickly fixed the issue. Now it’s time to analyse the root cause! Welcome to the post-mortem!

Today we will investigate what happened and learn how to avoid panics in the future. Ready? Let’s dive in!

In the first part of the story, you learned the hard way that missing data is very dangerous. Especially, when changing configuration on production servers.

So, now, when all emotions have settled down, it’s time to reflect on what happened. Think about this as a personal post-mortem.

Creating a policy

We know that one of the applications in your cluster requires authentication using API keys. To configure the authentication, you create the Policy object, for example:

apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
  name: api-key-policy
spec:
  apiKey:
    suppliedIn:
      header:
      - "X-API-Key"
    clientSecret: api-key-client-secret

You specify the policy type (apiKey), the place for the API key (header or query), and the Secret’s name (clientSecret). You create this object (kubectl apply) in the cluster and link it with your application. From this moment, your application is ready to accept authenticated requests.

To investigate the root problem, it will help to see how the Policy object looks in the JSON format. Let’s find out.

You run the kubectl CLI tool, pipe output to the jq app and filter the spec section.

kubectl get policies.k8s.nginx.org api-key-policy -ojson | jq .spec
{
  "apiKey": {
    "clientSecret": "api-key-client-secret",
    "suppliedIn": {
      "header": [
        "X-API-Key"
      ]
    }
  }
}

Validation

Uploading the Policy object to the cluster triggers validation process. It includes running a few Go functions that check if the uploaded Policy is valid. The functions are part of the NGINX Ingress Controller binary. They live in the validation package.

But what does it mean that the policy is valid?

As we said, the app in your production cluster accepts only authenticated requests. When you create the policy, you specify where in the request the value of the API key is. You can place it in:

  • the request header
  • the URI, as a parameter
  • in both, the header and the URI

The policy is valid when the field suppliedIn has a correct value. The value, as we already know, must be either the header or query, or both. For example, the following policy is valid. It has values for clientSecret and the suppliedIn field.

{
  "apiKey": {
    "clientSecret": "api-key-client-secret",
    "suppliedIn": {
      "header": [
        "X-API-Key"
      ]
    }
  }
}

The crucial question is how the validation function work.

You must unmarshal the JSON object listed above into some Go data structure, some struct. Then, the validation function takes the struct. It runs business logic and decide whether to accept the policy or return an error.

The 10,000-foot view of this process looks like this:

  1. You upload the policy object to the Kubernetes cluster.
  2. The NGINX Ingress Controller detects the new object and starts the validation process.
  3. The system converts the Policy object from JSON to a Go struct.
  4. The Go validation functions take the struct and check if the struct is valid or not.

When the policy is valid

If all goes to plan, the policy is valid and you see no errors or warnings in the Ingress Controller logs. Happy days! You can sit back, relax, and enjoy your tea or coffee break.

When things go south

What happens when the policy is invalid? Well, what would you expect in this case?

Adding the invalid policy should not crash the Ingress Controller! You should get a notification about an error. But not a PagerDuty “Production DOWN” alarm on your phone!

The invalid policy you applied looks like this:

apiVersion: k8s.nginx.org/v1
kind: Policy
metadata:
  name: api-key-policy
spec:
  apiKey:
    suppliedIn:
    clientSecret: api-key-client-secret

and in the JSON, like this:

{
  "apiKey": {
    "clientSecret": "api-key-client-secret"
  }
}

As you learned, this JSON object is unmarshalled into a Go struct. Let’s see how structs look and write a short experiment to illustrate the process.

package main

import (
	"encoding/json"
	"fmt"
)

type APIKey struct {
	SuppliedIn   *SuppliedIn `json:"suppliedIn"`
	ClientSecret string      `json:"clientSecret"`
}

type SuppliedIn struct {
	Header []string `json:"header"`
	Query  []string `json:"query"`
}

func main() {
	// apikey represents the invalid key uploaded to the cluster.
	// note missing the `suppliedIn` field!
	apikey := `{"clientSecret":"secret-name"}`

	var key APIKey
	err := json.Unmarshal([]byte(apikey), &key)
	if err != nil {
		panic(err)
	}

	fmt.Printf("%+v\n", key)
}

Let’s run it and see what your API key looks like.

go run policy.go
{SuppliedIn:<nil> ClientSecret:secret-name}

The object is missing the suppliedIn field! In other words, its value is nil.

But why does the invalid object crash the NGINX Ingress Controller? Let’s find out!

Finding the needle in the (hay)stack

After the Ingress Controller panics, you can see the following stack trace.

Click here to see the full stack trace
§E1119 05:49:06.668128       1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
        goroutine 957 [running]:
        k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0x98
        k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x2522ec8, 0x39c3d20}, {0x1c76a40, 0x3957b40}, {0x39c3d20, 0x0, 0x4000b93598?})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:82 +0x60
        k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4000d4e700?})
                k8s.io/[email protected]/pkg/util/runtime/runtime.go:59 +0x114
        panic({0x1c76a40?, 0x3957b40?})
                runtime/panic.go:785 +0x124
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validateAPIKey(0x40005c4588, 0x4000900fc0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:297 +0x2c
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validatePolicySpec(0x4000c285e8, 0x4000900f90, 0x0, 0x0, 0x0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:76 +0x814
        github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.ValidatePolicy(0x4000c284e0, 0x0, 0x0, 0x0)
                github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:19 +0x74
        github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).syncPolicy(0x400016c288, {0x16?, {0x4000b18390?, 0x4000102f08?}})
                github.com/nginxinc/kubernetes-ingress/internal/k8s/policy.go:74 +0x180
        github.com/nginxinc/kubernetes-ingress/internal/k8s.(*LoadBalancerController).sync(0x400016c288, {0x1b1c1e0?, {0x4000b18390?, 0x0?}})
                github.com/nginxinc/kubernetes-ingress/internal/k8s/controller.go:959 +0x478
...
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xa1fc4c]

Well, that’s a lot of cryptic information! At least for now.

Your job is to digest it and find information relevant to your case. How can you do this?

Let’s check the first line. The Go runtime informs you that the panic occurred and what caused the panic.

panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"

The Go runtime tells you it cannot continue running the application. The safest, and only possible course of action is to crash the program and exit. In other words, the program has a problem. It cannot continue to run. It tells you: sorry, I do not know what to do. I am done!

Is it bad or is it good? Well, it depends!

Imagine you are sitting relaxed in a self-driving car. You are taking a sip of your afternoon tea. You are going to your friend’s wedding.

You are getting closer to a crossroad, and your car is about to change direction. Suddenly, the car detects an obstacle on the road - a fallen tree! The car can:

  • Stop before turning. This will stop the journey and you will be late for the party. But you and the passengers will be safe. The only one way to continue your trip would be to start the car again and change the route.

  • Continue driving. The car will turn and try to run over the fallen tree, hoping for the best. You could get hurt and the car might get damaged. Depending on the condition of the car, the journey could or could not continue!

So, what is the outcome of the two scenarios?

In the first one, the top priority is the passengers and the car. The car stops (panics). It doesn’t care if you are late to the party, as the primary concern is your safety.

The second option is far more dangerous. If the car won’t stop, it will hurt passengers. The possibility of driving a damaged car can cause further destruction. Even if the car gets all passengers to the party, it may not be in good shape. The same goes for the passengers.

Go runtime acts like the self-driving car in the first example. It prioritises the safety, prevents possible data integrity loss, and other unpredictable damage.

Now, let’s see what the runtime tells you and check the two lines of the stack trace.

github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation.validateAPIKey(0x40005c4588, 0x4000900fc0)
github.com/nginxinc/kubernetes-ingress/pkg/apis/configuration/validation/policy.go:297 +0x2c

What can you learn from this part? You see the first clue! At the first line, you see the package name: github.com/nginxinc/kubernetes-ingress. Then, the path to the package: pkg/apis/configuration/validation, and the function validateAPIKey. The second line tells you the file name and the line number where the problem is: policy.go:297.

Go runtime points to the file and the line along with the package name. It tells you: hey, this is a good starting point for your investigation.

How does this function look like? Thats a lot to unpack!

Click here to see the validation function
func validateAPIKey(apiKey *v1.APIKey, fieldPath *field.Path) field.ErrorList {
	allErrs := field.ErrorList{}

	if apiKey.SuppliedIn.Query == nil && apiKey.SuppliedIn.Header == nil {
		msg := "at least one query or header name must be provided"
		allErrs = append(allErrs, field.Required(fieldPath.Child("SuppliedIn"), msg))
	}

	if apiKey.SuppliedIn.Header != nil {
		for _, header := range apiKey.SuppliedIn.Header {
			for _, msg := range validation.IsHTTPHeaderName(header) {
				allErrs = append(allErrs, field.Invalid(fieldPath.Child("suppliedIn.header"), header, msg))
			}
		}
	}

	if apiKey.SuppliedIn.Query != nil {
		for _, query := range apiKey.SuppliedIn.Query {
			if err := ValidateEscapedString(query); err != nil {
				allErrs = append(allErrs, field.Invalid(fieldPath.Child("suppliedIn.query"), query, err.Error()))
			}
		}
	}

	if apiKey.ClientSecret == "" {
		allErrs = append(allErrs, field.Required(fieldPath.Child("clientSecret"), ""))
	}

	allErrs = append(allErrs, validateSecretName(apiKey.ClientSecret, fieldPath.Child("clientSecret"))...)

	return allErrs
}

Let’s clean up the function, remove irrelevant code, and read with care. We are about to discover the problem responsible for breaking your production system!

func validateAPIKey(apiKey *v1.APIKey, fieldPath *field.Path) field.ErrorList {
	allErrs := field.ErrorList{}

	if apiKey.SuppliedIn.Query == nil && apiKey.SuppliedIn.Header == nil {
    // handle errors
  }

	if apiKey.SuppliedIn.Header != nil {
		// handle header validation error
	}

	if apiKey.SuppliedIn.Query != nil {
		// handle query validation error
	}

	if apiKey.ClientSecret == "" {
    // handle client secret validation error
  }
	allErrs = append(allErrs, validateSecretName(apiKey.ClientSecret, fieldPath.Child("clientSecret"))...)
	return allErrs
}

The listing presents a trimmed-down validator code. We highlighted the checkpoints representing the validation of specific rules.

What is the rule we check first?

Let’s zoom in and shed some light on the most important line.

func validateAPIKey(apiKey *v1.APIKey, fieldPath *field.Path) field.ErrorList {
    // ...
	if apiKey.SuppliedIn.Query == nil && apiKey.SuppliedIn.Header == nil {
    // ...
  }
  ...

The function asks if the Query and the Header are not provided (nil) at the same time. In other words, if they do not exist, the policy is not valid. End of the story!

The validation function assumes that the SuppliedIn struct is not nil, and must have values for either Query or Header.

Assumptions are dangerous things to make, and like all dangerous things to make – bombs, for instance, or strawberry shortcake – if you make even the tiniest mistake you can find yourself in terrible trouble.

– Lemony Snicket, The Austere Academy.

At this moment, you may see the final discovery! The function relies on assumptions and doesn’t check if the required field exists. It goes straight to check the values of the fields Query and Header. The consequence? The Go runtime panics as it’s not possible to read field values of a non-existing struct! What about a unit test, you ask. Well, there is none.

What can we do about this terrible and dangerous situation? This is the work we will do next week! We will roll up our sleeves and design the validator with a Test-First mindset. This is the opposite of the Test Last Development method. Sadly, TLD is quite popular too. Maybe even in your company?

Until next time!