Post Mortem New Years Eve

February 09, 2023 • Tiger Team X •

https://tigerteamx.com/blog/posts/post-mortem-new-years-eve/corruption.jpg

What Happened

On December 31st, 2022 near midnight, our production suffered an outage. It was caused by corruption in our PostgreSQL database, but the reason why it occured in the first place was unclear.

All attempts to start the Docker Compose instance failed. We had made a backup of the production environment, or so we thought, but when trying to restore, we found that it contained only empty databases. The backups appeared successful, but lacked any data.

The issue was traced to multiple Postgres instances initiated by Docker Compose, competing with/for the actual database. The backup script ran without triggering an error but produced empty results due to backing up an empty instance.

How we recovered partially

We were able to recover using a 1.5 month old backup that still had some of the relevant data.

Making good use of our logs, we were able to recover additional information and get the database running again, but it wasn't optimal as the data was corrupted. Scripts and manual data insertion were then used to bring the database back to its full state.

Process Changes

To mitigate the risk of such an incident occurring again in the future we’ve taken some precautions and put some procedures in place.

The production environment and code were in a pretty cluttered state due to over 6 years of development and infrequent refactoring.

In January we made a collective effort to simplify and streamline the code and production environment.

The system now runs on bare metals instead of Docker Compose and the number of dependencies has been slashed to a minimum.

Such steps have been taken before us, and we sure won’t be the last on the quest to reducing complexity:

https://world.hey.com/dhh/they-re-rebuilding-the-death-star-of-complexity-4fb5d08d

This should hopefully prevent the above mentioned incident from recurring.

To ensure proper functioning of our backups and that the data they contain is intact we need to really put our minds to work.

A Python Fabric Script has been created and can be run with one command to reset the staging environment with the latest backup. In a nutshell, this is what it does:

  • The script retrieves the backup from a backup server (located on a different continent)
  • Removes sensitive information,
  • Updates developer and tester accounts to admin status
  • Overrides the staging environment database
  • Prints out the latest occurring events.

Positive outcomes of this measure:

  • Ensures that the staging environment remains up to date
  • The backups are working properly - if an error is triggered we know our database backup might not be complete
  • Checking the latest events upon detecting an error can reveal a problem and see if any data is recovering from an old backup

Refactoring efforts have also been prioritized for the future. In the past 5-6 years, refactoring was not given the necessary importance, but in the future, at least a week of refactoring semi-annually is planned and scheduled in the team lead's calendar.

Going Forward

We plan to improve our image backup process. Currently, we have hundreds of gigabytes of images that are backed up but aren’t subjected to a routine check.

Over the next few weeks, we will create a script that compares the images in the production and backup systems to ensure they're up to date and there is no discrepancies. This process is expected to take a significant amount of time as it involves a large amount of data.

If you found this interesting, you will likely find How We Do Software interesting as well.

Released

This have been released with our client’s gracious approval.