uFincs Update #5
DevOps Detour - Part 1
Today, in the inconsistently scheduled uFincs Update series, we're taking a little detour.
DevOps is the matter concerning me today. And as much as the 'dev' and 'ops' teams at uFincs are highly integrated, it's more about Devin + Ops than anything :)
Last time was all about being positive, talking about business prospects and marketing strategies. You know, the things I should be focusing on.
Well, this isn't a 'detour' for nothing.
Deciding What to Do
As last month came to a close, I was put into an awkward position in terms of my scheduling.
See, on the first of each month, I usually put together a plan for what I want to accomplish that month and how I'm going to do it. I also do similar planning sessions at the start of each week.
Well, April 1st awkwardly landed towards the end of the week. So, instead of planning April early (which, as I laid out in Update #4, was gonna be all about marketing), I decided to deal with some DevOps tasks that I had been putting off.
I figured they'd be a good break from the usual frontend/backend work of feature development and be relatively quick to do.
tl;dr one thing just led to another and this detour took a bit longer than I expected.
For the longest time, uFincs' database backup solution was a daily Kubernetes cronjob that dumped the whole database and saved it to a Google Cloud Storage (GCS) bucket. There were a couple of improvements I had been meaning to make to this system:
Encrypt the backups.
Why? Cause other people say to encrypt your backups.
I mean, I guess this prevents against an attacker from breaking into my GCP account and stealing the plaintext backups, but there are worse things to worry about in case of an account breach.
I suppose this is more relevant to organizations where employee count > 1, to prevent unauthorized data access, but nonetheless, I figured it'd be a good layer of protection to have.
Auto delete old backups.
I mean yeah, it's good to have, especially considering I've got some 1+ year-old backups at this point.
Not to mention I've got a bunch of generally pointless backups lingering around, thanks to our deployment setup (per-branch deployments courtesy of Kubails).
Copy backups somewhere else.
Remember kids, 3-2-1 backups, so having a separate + remote place to keep a copy of backups is always good.
I figured an S3 bucket in AWS is 'good enough' for our purposes.
Restore the latest backup to new namespaces.
Like I said above, we make use of per-branch deployments. This means that every feature branch in development gets its own namespace on our Kubernetes cluster so that we can fully deploy and test every change we make before it hits production. But because production runs using the same mechanism (in the same cluster!), that means that each 'feature namespace' is, in itself, effectively a copy of production.
However, it isn't a copy of production since the database only gets bootstrapped with the basic seed data for a test account. Or at least, it did until we decided to make this change.
By restoring the latest backup to each feature branch namespace, we'd accomplish 2 things:
We'd be able to test, using production data, everything new that we develop. This also means that getting other users to test new features would be very easy since their data would be available.
It'd satisfy the 'backup system requirement' to always be testing your backups (you know, "an untested backup isn't a backup"). And having backups tested on the frequent basis that is our development work? Even better!
And since all user data is encrypted, even if something did go wrong during development, the chances of something going is greatly reduced.
So yeah, these 4 things would make the uFincs backup system damn near bulletproof.
The only weakness is that, as I must sadly admit, I haven't yet delved into dealing with WAL backups (FYI, we use Postgres) for point-in-time recovery. As of yet, I've been content with just daily backups, considering the meagre size of our database.
And that will probably suffice for quite a while. If things start to pick up, I'll probably switch to twice-daily or even more frequent backups, but for now, this works.
Anyways, getting these 4 improvements in place only took about 2 days.
Encrypting backups was done with a simple GPG key and AES-256. Encrypt the backup files with GPG, then store the GPG key in the repo itself encrypted with a GCP KMS key.
This is a common scheme that we use for secrets. We keep keys in GCP KMS that encrypt the secret values so that we can store the encrypted versions in the repo and then decrypt them during deploy time.
You might wonder then, "why not just encrypt the database files with a KMS key?". A totally valid question. As stated in the docs, KMS has a 64 KiB limit for data encryption. Enough to encrypt a GPG key (creating a scheme known as 'envelope encryption'), but certainly not enough to encrypt a production database dump.
Auto Deleting old Backups
Auto deleting old backups was as easy as slapping a lifecycle rule on the storage bucket. Or at least, I thought it'd be. It turned out to be slightly trickier.
See, I use Terraform (as part of Kubails) to manage all my infrastructure. I figured adding a lifecycle rule to the storage bucket was just this:
age = 60 # Days
type = "Delete"
This, in my mind, would enable deleting bucket objects that were older than 60 days.
Well, it turns out that using this rule wasn't enough. I don't know if this is GCP's fault or Terraform's, but a second condition ends up being set on the bucket: "Live State = Non-current".
This basically means that an object older than 60 days would only be deleted if it wasn't the 'current version' of the object. Except 'versions' only apply when you have object versioning turned on.
I didn't have object versioning turned on.
So this lifecycle rule actually did nothing.
I only realized this when, after a couple of days, the old backups in the bucket still hadn't been deleted.
So the 'correct' rule to apply is as follows:
age = 60 # Days
with_state = "LIVE"
type = "Delete"
Now 'live' objects (aka my normal objects) will be deleted.
This also turned out to be relevant for one of my other storage buckets (used for archiving logs) that had the same problem. It had been keeping around 1+ years' worth of logs even though they were supposed to be deleted after 6 months.
Goes to show how often I check the archives.
AWS Backup Replication
Copying backups to AWS was relatively simple. After I re-familiarized myself with how AWS authentication works (really, it's just access keys?), it was a simple matter of adding the AWS CLI to the Docker container that is used for the backup job (as in, the Kubernetes cronjob), injecting in the access key at run time, and copying the encrypted backup to an S3 bucket in addition to the usual GCS bucket.
The only other thing that was notable was my decision to use a separate Terraform config/state to handle AWS infrastructure (aka the single S3 bucket and some IAM stuff). Since all of the GCP Terraform infra was integrated into Kubails, I didn't want to 'taint' it with AWS. Plus, this was just simpler.
Restoring Backups to New Namespaces
As utterly cool and useful as this feature is, it wasn't actually that hard to set up. Essentially, all I had to do was modify a step in our build pipeline to fetch the latest (encrypted!) production backup, decrypt it, and then restore it to the newly created database in the newly created namespace.
Although, it was only so simple because I had previously done the work to make deploying the database a separate step of the build pipeline (so that we could also perform migrations separately from deploying the main services), so modifying that step was fairly straight forward.
And now that I've gotten to make use of this improvement, I must admit, per-branch database restores is really cool. And useful. Usefully cool.
I thought this is what the extent of my DevOps improvements was gonna be, but it just kept snowballing from there...
Stay tuned for Part 2 of the DevOps Detour!