I broke production and could have noticed it was going to happen one week earlier

Table of Contents

This story recounts how I caused disruptions in several production services. I detest causing problems in production, so I strive to prevent them. However, sometimes my efforts are insufficient, and errors occur.

I firmly believe failing is acceptable, provided we learn from our mistakes. Sharing our failures, despite how uncomfortable it might be, we provide an opportunity for others to learn from them as well.

About the incident

A feature/fix intended to simplify the process of moving services to new hosts resulted in over five hours of downtime for requests targeting the publicly visible hostnames of three services. Operations’ efforts to resolve these issues inadvertently caused additional downtime across more services.

Blue/green deploy

The service in question facilitates blue/green deployments, supporting zero-downtime deployment of applications. It uses a proxy to route all requests to the current live version while deploying a new version to a staging site. Once the staging site is ready, all traffic is redirected to it, and the old site is decommissioned, ensuring no requests are lost during application updates.

This adds complexity to the infrastructure surrounding an application, and is thus only used for applications where downtime cause serious disruptions.

The feature/fix for moving servers

The last time we moved a zero-downtime site to a new server, we encountered an issue. Deploying to the new server caused it to attempt contacting the old server due to the DNS entry pointing to the old server.

Our setup uses the .local top-level domain internally, while some services are also accessible externally via .com.

For instance, we had server-old and server-new. The DNS entry for app.local pointed to server-old. Deploying to server-new caused it to contact the proxy at https://app.local/_proxy, which in turn redirected to https://server-old/_proxy instead of https://server-new/_proxy.

Fortunately, our proxy is designed to accept configuration requests only from localhost. This prevented the live application at server-old from being corrupted by the new deploy to server-new, instead triggering an error and halting the deployment to server-new.

However, this situation made zero-downtime server migration challenging. Operations modified the hosts file so app.local resolved to server-new, preventing requests from reaching server-old. After verifying the application worked, the DNS could be pointed to server-new.

To circumvent these DNS issues, we revised the scripts and proxy configuration to exclusively utilize localhost.

The following table shows the old and new behavior.

Old/New Url Host Site Binding
Old http://app.local app.local app.local
Old http://app.local app.com app.local
New http://localhost app.local app.local
New http://localhost app.com app.local

This modification introduced a critical bug. The underlying site only had a binding for app.local and lacked a binding for app.com. Thus, when requests originated from the external hostname, they could not be routed correctly to the site as shown in the last row as app.local was never sent to the webserver.

Timeline of the events

March 6th, 09:11 - Introducing the bug

After completing development and testing, I merged the code, which had undergone several iterations in Development and Staging without revealing any issues.

… but our tests did not include hostnames other than app.local.

March 6th, 09:44 - Pushing the bug to Acceptance Testing (AT)

I happily pushed to AT, which have external hostnames and triggers the problem. But I obviously didn’t test with an application which had any of these.

I confidently deployed to AT, where sites is also available using external hostnames. I failed to test with a site that utilized external names, hence did not trigger the bug.

March 11th, 10:57 - Pushing the bug to Production

Eventually, I deployed to Production. However, the service supporting zero-downtime is designed to avoid downtime, and only manual restarts of the proxy (unique instance per site) activated the updated version.

Patching of servers scheduled for March 14th, 01:00

Regular patching of our VMs necessitated scheduled downtime on our production servers a few days post-deployment. A reboot post-security patching would transition all sites under the zero-downtime proxies to utilize the new proxy version, routing requests over localhost.

I had notified Operations that the scheduled server reboot would use the new proxies which uses the loopback interface, so at least they knew I had to be contacted when things exploded.

March 12th, 22:00 - A single proxy restarted in AT

Nearly six days after the AT deployment, and just one day before the production update, a proxy for a single externally visible service was restarted (I have yet to diagnose why this happened). This action activated the buggy version in AT, which should have led to noticeable problems.

Yet, the issue went unnoticed…

March 13th, 13:12 - User notice a failure in AT

15 hours after the errors started coming in, a developer reported encountering an error for a site in AT:

Bad Request - Invalid Hostname

HTTP Error 400. The request hostname is invalid.

Several individuals investigated the issue. Initially, I suspected alterations to the public-facing proxy, but no changes were identified. Subsequently, we considered a modification to the local DNS potentially rerouting app.com requests to app.local, a practice occasionally observed in the past for unknown reasons.

However, the day concluded without an explanation of the host issues. My changes had been operational in AT for seven days, so it never occurred to me that my code might be the culprit.

… and we were merely 11 hours from this issue affecting Production.

March 14th, 01:06 - Production servers restarted

Server patching commenced, and reboots followed. Monitoring was disabled during patching to prevent alert floods. Once patching and reboots concluded, I presume monitoring was reactivated, revealing that synthetic monitors for three services were non-functional.

01:06-05:15 - Operations trying to figure out what’s happening

I have not delved into all the actions taken, but I observed additional reboots, redeployments, and manual deletions of sites on the web server.

04:41-05:15 - Operations tries to reach me

I received calls at 04:41, 05:14, and 05:15, awakening only for the last one. My phone was set to Bedtime mode, which only allows for vibration, and since I do not keep my phone within arm’s reach, I missed the earlier calls.

05:18 - Started looking into it

Recognizing the caller, I immediately knew there was a problem with my code. I logged in promptly to assess the situation.

05:22 - Deleting the partial sites

I discovered a corrupted state in the web servers where only the proxies were installed without the actual sites. I removed the proxies as well, as the deployment script relies on their presence to determine if the infrastructure is correctly set up, lacking a comprehensive total desired state configuration.

05:24 - Redeploy done

Minutes later, the sites were redeployed, and my tests confirmed that normal operations had been restored.

05:42 - Understanding it’s still down for the public address

Despite resolving the internal issues, monitors continued to indicate failures for public requests. Recalling the issue in AT reported the previous day, I figured we could use AT for some testing to expose the underlying cause.

06:09 - Pushing out a debug version with some logging to AT

Lacking sufficient diagnostic information, I created a branch with additional debug logging and deployed it to AT. This enabled us to view the URL and header information at the right time, clarifying that the web server would be unable to route requests appropriately due to insufficient information available.

We contemplated various solutions, including rolling back to a previous version, attempting a hack to route the requests correctly, eliminating zero-downtime deployments, and manual interventions.

Ultimately, we manually added bindings for app.com to the sites. This was concluded to be safe as the proxy did not alter the blue/green site bindings during deployment.

06:26 - All good

Shortly thereafter, we manually added bindings for the affected services in Production and AT, finally restoring normal operations also for the external hostnames.

07:10-08:00 - Breakfast and commute to the office

08:00-09:25 - Fix for adding additional hosts to listen to as Pull Request

A more sustainable solution was necessary. I considered several options:

  1. Reverting the “route using localhost” approach, accepting the migration challenges.
  2. Introducing a custom header and middleware, diverging from the standard “forwarded headers” middleware, necessitating updates to all projects for compatibility with our blue/green proxy.
  3. Enabling listening on public hostnames, which introduces maintenance challenges due to the potential for hostname changes.

I opted for the third alternative as a quick fix, avoiding project modifications, though the second option might offer a more permanent solution by eliminating hostname duplication.

What went wrong?

Overlooking Public Address Testing

Testing the sites with public hostnames completely escaped me. This was crucial to test to ensure more hostnames works as well as multiple proxy jumps.

Incomplete Proxy Restart in Test Environments

Our service, designed to support zero downtime, requires manual intervention to induce downtime. However, my testing was incomplete; I only restarted and tested a subset of sites and failed to apply this process across all services in ST and AT. A comprehensive restart across all environments was necessary to validate the update thoroughly.

Absence of Preliminary Manual Testing in Production

We proceeded with a comprehensive upgrade of all production services simultaneously, overnight, while I was sleeping. Coupled with concurrent changes, this approach significantly increased the risk of complications, deviating from best practices of incremental updates.

Lack of Synthetic Monitoring in AT

We have 24/7/365 on-site personnel monitoring our software. We want to avoid false positive alarms, and alerts in AT probably do not warrant calling a developer at night, nor spamming alerts so actual production alerts goes unnoticed.

Consequently, the errors that emerged 50 hours before affecting production remained undetected.

Inadequate Monitoring in AT

The AT environment suffers from insufficient automated monitoring, leading to a delay in issue detection. This resulted in a narrow window of approximately 15 hours (but only about 3 working hours) to identify and address the issues before they impacted production.

Corrective actions

Calls during “Bedtime mode” should make sound

I modified “Bedtime mode” to allow sound for calls and not just vibration. This change ensures that I can be reached during critical situations, even if my phone is not within arm’s reach.

Add bindings for public hostnames

To address the immediate issue, we implemented a short-term solution to listen also on the public names and not just the internal names. This ensured that requests coming from external hostnames could be routed correctly to their intended destinations.

More monitoring in AT

Recognizing that AT is our last chance to detect problems before they affect Production, we need to enhance monitoring in this environment. Even with my faults during this update, better monitoring would have given us a window of 50 hours, rather than just 15 (or effectively 3 working hours), to investigate, fix, and/or postpone the upgrade.

Better documentation and process for testing

It became clear that the main issue was insufficient testing. Moving forward, testing should include restarting all services in ST and all in AT. Documentation and testing processes will be updated to clearly state this requirement to ensure thorough testing for future upgrades.

Update Production manually

Reflecting on the deployment process, a “big bang” release, especially overnight, is not advisable. I should have updated each site in Production manually and verified they were working before moving on to the next. This approach minimizes risk and ensures that any potential issues can be identified and addressed in a controlled manner.

Use custom header to avoid duplicating hostnames

Finally, I began contemplating a more sustainable solution to the problem of handling requests from both internal and public hostnames without requiring duplicate hostname bindings. One potential approach is to use our own custom header to identify the intended app.com name while the Host header points app.local to ensure correct routing. This would necessitate modifying the middleware to copy this custom header to X-Forwarded-Host before handing execution over to a general “forwarded headers” middleware.

Conclusion

  • Lack of automated testing leads to fragile systems.
  • Manual testing needs to follow a diligent documented process.
  • Testing environments which differ too much from Production is unable to identify issues that arise in Production.
  • Lack of monitoring in testing environments makes Production an testing environment.

NOTE: I’ve used ChatGPT 4 as a proofreader and language assistant.

Date: 2024-03-14 Thu 00:00

Author: Simen Endsjø

Created: 2024-08-27 Tue 22:24