Extended Error Use Case: calm

Table of Contents

Go Error Handling - This article is part of a series.

Part 1: A Generic Go Custom Error Wrapper

Part 2: Extended Error Use Case: errclass

Part 3: Extended Error Use Case: errcontext

Part 4: Extended Error Use Case: stacktrace

Part 5: Extended Error Logging

Part 6: This Article

Use Case: General Panic Recovery
#

As a keyword, panic often induces the very feeling it describes. It usually comes with a stacktrace, but not as nice as the one we built earlier. Let’s make it better.

Problem Description
#

Like it or not, a panic can happen in your code: It could be that a third-party library didn’t like the input you provided; it could be your code actually hit a line of code you previously thought impossible; or it could be a developer made an off-by-one error and you hit a slice index out-of-bounds error.

If you’ve done your job right, when running real-world complicated services, a panic has roughly the same impact as a hardware failure. It’s something you should be able to handle, but it isn’t as well-tested a pathway. You might even find it extremely difficult to find the panic log after the fact.

Ideal Solution
#

Don’t panic. But since we can’t ensure that, the next best thing would be to recover from the panic at the right place in the program so we can gracefully shutdown everything else.

Let’s suppose we have a function func f() error that might panic. Then in an ideal world we could just call:

err := calm.Unpanic(f)

Where if f didn’t panic, then err is the return value of f, and if it did panic, err is an error that contains all the information we need about the panic itself.

Introducing The `calm` Package
#

The canonical way to recover from the panic keyword is with the recover keyword. It seems straightforward, and it mostly is until it isn’t. See the Go by Example page for the short version. The most important point is that recover must be called within a defer statement in order to actually work. So if we follow along with the example, then in order to achieve our goal, we need something like this:

func Unpanic(f func() error) (err error) {
    defer func() {
        if r := recover(); r != nil {
            // do something with r and set err
        }
    }()
    return f()
}

Named return values are sometimes helpful when you have many return values. They are also required when you want a defer statement to alter a return value, as in this case.

For the final error, we want to obtain a stack trace and leverage the errclass.Panic classification that we created previously. We’ll need to turn r into an error in order to actually use errclass and stacktrace.

fmt.Errorf("panic: %v", r)

Note that the full signatures are panic(v any) and recover() any, so r here is any type and using the %v verb is likely to get us the best results.

In practice, I’ve only seen anyone actually use a string for the panic value. That said, it might be a better practice to pass an actual error type, which as we have seen can carry much more interesting data. There’s a small problem with our %v verb here though, as this won’t actually cast r to the right type - it just ends up as a string. So let’s check if r is actually an error type and plan accordingly:

if e, ok := r.(error); ok {
    err = fmt.Errorf("panic: %w", e)
} else {
    err = fmt.Errorf("panic: %v", r)
}

Now adding a class and stack trace are easy:

err = stacktrace.Wrap(err)
err = errclass.WrapAs(err, errclass.Panic)

Try It Out
#

Putting it all together, we should have:

func Unpanic(f func() error) (err error) {
    defer func() {
        if r := recover(); r != nil {
            if e, ok := r.(error); ok {
                err = fmt.Errorf("panic: %w", e)
            } else {
                err = fmt.Errorf("panic: %v", r)
            }
            err = stacktrace.Wrap(err)
            err = errclass.WrapAs(err, errclass.Panic)
        }
    }()
    return f()
}

And let’s try a complex example to properly show it in action:

var errTest = errors.New("something went wrong")

func c() error {
    err := errcontext.Add(errTest, slog.String("hello", "world"))
    panic(err)
}

func b() error {
    return c()
}

func a() error {
    return b()
}

func main() {
    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))

    if err := calm.Unpanic(a); err != nil {
        logger.Error("request failed", xerrors.Log(err))
    }
}

It works — including preserving the errcontext that was added. However, let’s look more closely at the funcs in the stacktrace:

{
    "func": "Unpanic.func1"
}
{
    "func": "main.c"
}
{
    "func": "main.b"
}
{
    "func": "main.a"
}
{
    "func": "Unpanic"
}
{
    "func": "main.main"
}

That top frame on our stack (Unpanic.func1) is from the defer func inside Unpanic. That’s not where the panic happened (that was the next frame in main.c), but where the panic was recovered and at the exact line on which the stacktrace.Wrap was called. You might recall a similar discussion about the runtime frames in the stacktrace article .

In stacktrace.Wrap we set skipFrames=3 in order to filter out the call to Wrap itself. In our case, we also want to filter out the call to Unpanic.func1. Ironically, in order to do this we need to call stacktrace.GetStack ourselves, but since we won’t be calling Wrap, we also need skipFrames=3. If we were still calling Wrap, we would have needed skipFrames=4.

With that in mind, here’s the fix for our issue:

const panicStackDepth = 3

func Unpanic(f func() error) (err error) {
    defer func() {
        if r := recover(); r != nil {
            if e, ok := r.(error); ok {
                err = fmt.Errorf("panic: %w", e)
            } else {
                err = fmt.Errorf("panic: %v", r)
            }
            err = xerrors.Extend(stacktrace.GetStack(panicStackDepth, true), err)
            err = errclass.WrapAs(err, errclass.Panic)
        }
    }()
    return f()
}

Which does give us the stacktrace we want to see:

{
    "func": "main.c"
}
{
    "func": "main.b"
}
{
    "func": "main.a"
}
{
    "func": "Unpanic"
}
{
    "func": "main.main"
}

We actually just introduced a very subtle bug. I didn’t even notice until I was writing this article. See if you can figure it out for yourself before looking.

Hint

It’s purely a bug in performance, not in correctness.

Answer

Since we call stacktrace.GetStack directly, if we did something like:

func f() error {
    err := stacktrace.Wrap(errTest)
    panic(err)
}

Then the stacktrace would be generated twice: once inside f() and once again inside the defer of Unpanic. However, the second one would then be tossed out in the call to xerrors.Extend since the error already has a stacktrace.

To fix this, we should follow the example in Wrap and check for an existing stacktrace first:

if _, ok := xerrors.Extract[stacktrace.StackTrace](err); !ok {
    err = xerrors.Extend(stacktrace.GetStack(panicStackDepth, true), err)
}

In reality this almost certainly won’t matter, but we are writing a library for anyone to use as they see fit. You never know what your users might actually need.

Gotcha: panics inside goroutines
#

The funny thing about goroutines is they have their own stack. That’s kinda the point. This means that we can’t actually recover from a panic that happens inside a goroutine. For example:

func g() error {
    panic("uh-oh")
}

func main() {
    logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
    f := func() error {
        go g()
        time.Sleep(time.Second) // Hack to ensure timing. Don't do this.
        return nil
    }
    if err := calm.Unpanic(f); err != nil {
        logger.Error("problem", xerrors.Log(err))
    }
}

You might hope that because we call f inside of calm.Unpanic that it would catch the panic from g. It doesn’t.

It doesn’t because it can’t. When we call go g() the function g gets its own stack, and has no way to return to f let alone the defer statement we wrote inside Unpanic. This is a crucial point to understand that I cannot stress enough.

The proper way to take care of this requires you to actually get an error value back from a goroutine. There are several ways you might do this, but given how common the use case is, there exists an off-the-shelf solution that should be your go-to: golang.org/x/sync/errgroup .

Introducing The `errgroup` Package
#

We can easily wrap golang.org/x/sync/errgroup so that calls to Go and TryGo wrap f in calm.Unpanic:

func (g *Group) Go(f func() error) {
    g.group.Go(func() error {
        return calm.Unpanic(f)
    })
}

and

func (g *Group) TryGo(f func() error) bool {
    return g.group.TryGo(func() error {
        return calm.Unpanic(f)
    })
}

This way we can leverage errgroup when we wish to recover from a panic inside a goroutine. This isn’t a cure-all of course, because as we’ve seen, if f itself spawns one or more goroutines, there’s nothing we can do about a panic inside of those.

Moreover, if you actually read the source code for golang.org/x/sync/errgroup , the authors call out issues #53757, #74275, #74304, and #74306 as reasons why they don’t do this themselves. It’s clearly a long and storied history that we should not dismiss out of hand.

Here’s a summary of those issues:

#53757 — The original proposal, initially accepted. The argument for recovery was consistency: a panic in a goroutine crashes the program immediately, but the same panic in sequential code is recoverable. Bryan Mills argued recovering and re-raising at Wait() makes “concurrency an internal detail.”

#74275 — The implementation was reverted. The core problem: recovering panics delays their propagation “arbitrarily far into the future.” A bug that used to cause an immediate, obvious crash would instead let the program continue running in a broken state until Wait() is called — making mistakes latent rather than promptly discovered.

#74304 — Confirmed the revert. The approach of catching panics in goroutines and re-raising at Wait() caused cascading problems across dependent projects.

#74306 — A follow-up proposed making recovery opt-in via a Propagate field. Rejected. Alan Donovan’s reasoning: users who want this behavior can implement it themselves on top of errgroup. The standard library shouldn’t add niche toggles.

This final issue is actually great news for us:

users who want this behavior can implement it themselves on top of errgroup

That’s exactly what we have just done. Anyone that wishes to use our version of errgroup must consider the trade-off in exactly when a panic is going to be seen, but that’s precisely the point of the calm package in the first place.

Next Steps
#

The complete code for our calm and errgroup packages can be found on GitHub: calm ; errgroup . Note that the code will likely differ from what was presented here, as it is expanded and improved over time.

Use Case: General Panic Recovery#

Problem Description#

Ideal Solution#

Introducing The calm Package#

Try It Out#

Gotcha: panics inside goroutines#

Introducing The errgroup Package#

Next Steps#

Related Reading#