OK, time for round 2 of the uFincs DevOps Detour that oh so concerned my existence several weeks ago.
Last Time
Last time was all about backups. Encrypting backups, copying backups, deleting backups, but, most importantly of all, restoring backups.
Now it's cluster time.
Cluster Time
There were a handful of different cluster-related tasks that I wanted to take care of, but the more I did, the more tasks I found to do. Particularly security-related ones. Once I found the GKE security hardening guide, it was just a rabbit hole of tweaking and configuring, all in the hopes of 'making things more secure' (in theory).
So... I'll just go through them one by one.
Upgrading Kubernetes Version
For the longest time, I always specified the Kubernetes version directly in Terraform.
Well, turns out there's a better way.
Introducing: Google Kubernetes Engine (GKE) release channels! Just opt into a release channel (Rapid, Regular, or Stable) and GCP will handle upgrading the cluster for you!
Yeah, I'm pretty sure this feature didn't exist when first wrote the Terraform configs, otherwise I would have just opted into a release channel to start with.
Better late than never!
I decided to go with the Stable channel. I don't really make use of bleeding-edge Kubernetes features, and I'd rather things didn't break with my production cluster, so it only made sense. That left us on the latest patch of v1.17, one whole minor version up from our previous v1.16.
But at least now I don't really need to worry about version upgrades. Just let GCP upgrade the masters and nodes and that's good enough for me.
Enabling Shielded VMs
Shielded VMs are, in Google's words, "virtual machines (VMs) on Google Cloud hardened by a set of security controls that help defend against rootkits and bootkits".
In GKE terms, shielded nodes "limit the ability of an attacker to impersonate a node in your cluster".
This is nice and all but what really matters to me is that:
They are 'more secure' (for some definition of secure).
They are easily turned on.
They don't cost anything extra.
Yep, just flip a switch in the Terraform config for the cluster and everything becomes more magically secure, for free!
Hard to argue with that, so I figured it'd be a good thing to turn on.
Enabling Secure Boot
Good old UEFI Secure Boot. The bane of every developer wanting to install Ubuntu on their new Windows laptop (or maybe that's just me).
Anyways, enabling Secure Boot for GKE nodes "helps ensure that the system only runs authentic software by verifying the digital signature of all boot components".
But again, it's just a flag to turn on, and it's free, so on it goes in the name of 'security'!
Switching from Docker to containerd
This is a slightly more interesting decision than just "free + more secure = turn on".
See, Kubernetes announced that they were deprecating Docker as a runtime, in favour of containerd (basically, just a different but compatible container runtime).
Now, I'm not actually sure if there are any security benefits to making this change, but since it's a change we'd have to make eventually (and it's just a config option to change), I figured we might as well get ahead of the curve, validate that our workloads work with the change, and just commit to it.
Saves future me some trouble.
Switching from NodePort to ClusterIP Services
In order to understand this change, you first need to understand a bit of the internal architecture of our cluster.
See, instead of using GKE's built-in ingress, we actually use ingress-nginx (not to be confused with the Nginx Ingress). This is so that we only have to provision a single load balancer for the entire cluster, rather than having to provision one for each individual service (which would get extremely expensive very fast).
In effect, ingress-nginx
acts as an internal load balancer for the cluster. Network requests first hit the GCP load balancer, which then get routed to ingress-nginx
, which then routes the request to the correct service based on the request's subdomain.
This is especially effective for our use case since ingress-nginx
can even route between namespaces. This means it can handle our production namespace as well as all our per-branch namespaces.
At least when I first came up with this cluster architecture, the GKE ingress couldn't do that (and I haven't researched it since, so I still don't know). But considering how expensive load balancers are on GCP, only having to have one is a boon in terms of keeping costs low.
Anyways, this is all to say that, when I first set up the cluster, I had configured all of the main services to be of type NodePort
. Why? Cause... it worked.
But then recently, after looking over some other random cluster architecture diagrams, I had the realization that "wait a minute, if ingress-nginx
performs all of the routing in-cluster, then couldn't the services just be ClusterIP
?".
As it turns out, yes, yes they can.
So now the services don't even need to expose a port on every node, once again reducing the attack surface — if only by a little bit.
Switching to Two Nodes (from One)
Yes, for the longest time I only ran the cluster from a single node — first an n1-standard-1 node before 'upgrading' to an e2-medium. I also use preemptible nodes, so the one node that ran everything would disappear at least every day (before being replaced by another node). All in the name of cost savings.
Needless to say, I finally got annoyed by the daily uptime alerts that the site was down (although even that was good enough for something like 99.5% uptime).
So I finally relented and decided to add a mere second node to the cluster. Although, by itself, this wasn't really enough to prevent the site from going down; it merely reduced the chances that the node that was running everything would get preempted.
In order to get 'true' high availability, we'd need to make some other improvements...
Anti-Affinity Deployments
And here comes a really helpful feature of Kubernetes deployments: pod anti-affinity.
Basically, 'pod affinity' is a Kubernetes feature that enables pods to be scheduled onto nodes given a certain set of rules. For example, if a pod needs access to a node with a GPU attached, then an affinity rule could specify as much.
But for our purposes, anti-affinity is more useful. These kinds of rules allow us to basically say "don't schedule pods of the same deployment on the same node". In effect, the replicas of a deployment will get evenly spread across all our nodes, as long as we have enough nodes.
So, to make sure uFincs is 'highly available' (or HA), all we need to is up the number of replicas for each service to 2 (to match the new number of nodes), add an anti-affinity rule, and voila!
Or so I thought.
See, these anti-affinity rules currently only happen during scheduling, not after the replicas are already running. It's important to understand this distinction to understand why this setup doesn't actually get us to HA yet.
Let's take the example of a fresh deployment. We have 2 nodes and each service has 2 replicas, one on each node.
But what happens when a node gets preempted? Well, the node will disappear from the cluster and the replicas on it will be (forcefully) terminated. Kubernetes will then see that the replica count no longer matches the desired state and will schedule the second replicas onto the remaining node. After some undetermined amount of time, GCP will provision a new (second) node to fill in for the node that was just preempted.
The key thing here is that, although not strictly deterministic AFAIK, the new node gets added to the cluster after Kubernetes has already re-scheduled the replicas. And since pod anti-affinity rules only (currently) work during scheduling (as stated above), those replicas just stay there. On that one node. Forever.
And then I get an uptime alert when that one node inevitably gets preempted.
So how do we fix this? How do we get Kubernetes to 'rebalance' the replicas after scheduling?
Well, as far as I'm aware, Kubernetes doesn't have a built-in way to handle this. Supposedly, there are plans to change this so that the affinity rules can work after scheduling, but this isn't a thing yet.
So we hack around the problem. Introducing: the Descheduler.
Adding the Descheduler
The Descheduler is ultimately a ridiculously simple solution to our 'rebalancing' problem: just run a cronjob every so often that kills duplicate pods on every node.
Yes, that's right, just kill them.
Then Kubernetes will have to reschedule the replicas and, assuming we have both nodes available, the new replicas will get put onto the second node.
Tada! Problem solved. Clean, simple, efficient.
Really, the only decision there was to be made here was how often to run the cronjob. Since nodes are preempted at least once every 24 hours, I figured running the job hourly should be enough to keep the cluster balanced. That is, I wouldn't expect both nodes to be preempted within one hour of each other. Is it possible? Of course, but it's not like the costs of assuming this incorrectly are particularly high.
Anyways, the cronjob seems rather lightweight to run, so increasing its run frequency isn't exactly a terrible thing to do.
Point is, with two nodes, pod anti-affinity rules, and this magical descheduler, we now have a minimal setup for high availability.
Could we go further? Of course. More nodes, more replicas, etc. But seeing as how uFincs is just starting out (and has little load on it), there's no reason to do so yet. I mean, you could (rightfully) argue that there's no reason to even be using Kubernetes at all for a service like uFincs.
But if we did want to scale to the moon, the engineer in me enjoys the fact that it wouldn't be much effort.
Well, I think that's been more than enough for one post. Stay tuned for Part 3 of the DevOps Detour!
How many parts will there be, you ask? No idea.