In the last two weeks, Placemark's uptime graph has been unhappy, the result of several bugs in its code and the dependencies it relies on. So I've spent time improving the way we track & fix bugs. This post is a bit about the setup, plus some postmortem - some of what has been going wrong, and some of what I've been trying to improve.
The issues, in short, have been:
All of these were either already reported, or I spent some time narrowing down the cause as best I can and filing detailed bug reports. Being willing to do that, to spend a day hunting down a bug that you didn't write, seems essential. It's a lot of work, but it's a lot easier for you to isolate and diagnose a bug you're seeing than it is for someone without access to your setup.
All software has bugs. All software development adds bugs. It doesn't matter if you have a fancy type system or robust unit tests. You can reduce the number of bugs, but you can never really get to zero.
What matters is your ability to notice, diagnose, and fix those bugs: to create a feedback loop. In a perfect world, every bug would have a detailed and accurate report that let you track down its origin. Every bug that happens in production can be reproduced locally. And when you have a fix, you can deploy it immediately and confirm that the bug is gone.
That's the goal. We don't live in a perfect world, and usually none of these things are absolutely true. For example, the issues with Blitz I encountered only manifested in production. The issues with Sentry produced a bug report that didn't point to any particular system or line of code. Interpreting bugs shouldn't be an art, but it is.
But to even have a fighting chance to achieve these goals, you need a few systems.
First, error tracking. I use Sentry, because I have a lot of experience using it at Mapbox and Observable, and they have an actively-maintained integration for Next.js, which works okay with Blitz. If you only had one system, I think it'd be error tracking. Sentry aims to hook into all of the ways your application can crash or produce an error, and send all of those errors to its service. Then, you can look at them all on a dashboard and try and fix everything on that list, one by one.
Ironically, one of my issues was caused by Sentry's integration - its low-level hooks into my server were changing the behavior of Node's streams, which caused an internal error in Node, and Node crashes hard when that happens.
Then, logging. Render has rudimentary logging built in, but I recently added Logtail for longer-retention logs that I can search through. Logging & error tracking can really be two sides of the same coin - some errors produce logs, and some logs are captured by Sentry as the context to errors. But there are important details that are captured only by logs - errors from system-level software, logs of deployments, debugging information, timing data. Logs are an unstructured mess, but if you spend some time searching through them, they can often answer hard questions about out-of-memory errors and the like.
Patterns. Here's the part where it's my fault. There are ways to write code that gives good tracebacks and ways to screw it up. For example, let's say that your code relies on some Promise value, and you aren't using await syntax, but using the "then" method. If you don't also call the catch method, you probably have a silent failure in your application.
Brief rant: silent failure is the worst kind of failure. I've noticed a habit of early software engineers that they tend to fix the error, not the bug. They'll see a crash and add try…catch around it so that it doesn't complain. The complaining isn't the issue: the bug is. Prefer loud failure. Heck, the Unix philosophy has said this for decades.
Rule of Repair: When you must fail, fail noisily and as soon as possible.
To fix bad error patterns, I've dialed up some static analysis: the no-floating-promises rule in typescript-eslint prevents that uncaught Promise rejection error, and the require-await rule catches me when I use an async function for no reason - a practice that makes my error reports worse.
And, of course, tests. I'm using Code Climate, which is remarkably 'free forever' for teams of under 4 people. The tests I run create code coverage statistics that I send to Code Climate and they give nice charts of coverage over time. Code coverage is only one limited way to measure test quality, and ironically the easiest things to test tend to be the least buggy parts of an application. But it's good to have a metric nudging me to write more tests. The next frontier of testing for me is true integration testing and also concurrency testing for the complex backend pieces.
There are a few ingredients that have helped tremendously in fixing Placemark & confirming those fixes:
Using a healthcheck endpoint with Render, I can do zero-downtime deploys and also prevent broken servers from being deployed. Render simply waits until an endpoint in Placemark returns the right response before replacing the old server with a new one. Deploys take about 4 minutes right now, which I'd love to reduce but feels basically acceptable.
By integrating heavily with Sentry's releases support, my error tracking knows when new versions of Placemark are released. This means that I can mark a bug as "fixed" in one release, and if it crops up in another, I can cross-reference my development history on GitHub against the behavior of deployed code to bisect the regression.
Thanks to committed maintainers and fast release cycles, this week should see the end of these crasher bugs. But in the meantime, I'm happy to have built some systems that keep the quality up.