uFincs Update #8

DevOps Detour - Part 3

May 09, 2021

It's time for round 3 of the DevOps Detour! The only matter that concerned my existence several weeks ago and that you're only learning about through these chopped-up blog posts!

Last Time

The last part was all about cluster upgrades and improvements. And guess what? We got more of the same today!

Only this time, we're gonna tackle upgrading some in-cluster services that have been out-of-date for far too long. Get ready for some hot, steamy, debugging action!

Upgrading cert-manager

Oh cert-manager, how I loathe thee.

Or at least, I did before realizing how easy the upgrade would be.

See, just like everything else in the cluster up to this point, it was kind of a "set it and forget it" affair. Couple of years back, cert-manager was still fairly new on the block. But since it was replacing a similar but deprecated project (kube-lego, I believe), I figured I just had to bet that it would be 'the' solution for.. cert management in Kubernetes.

Thank god I was right, cause I really did not want to deal with setting up automated TLS certs more than I had to.

In the beginning, most of the pain of cert-manager came from its infuriatingly vague error/log messages. This, combined with having to wait X hours for DNS changes to 'propagate', made debugging and getting the first set of TLS certs issued a royal pain in the ass.

And then there was that time that cert-manager had to be upgraded because of a bug that was causing it to ping Let's Encrypt too often...

And then there was the time they introduced the webhook service and it broke cert-manager in GKE unless you opened a very specific port to the control plane...

And then there was that time where deleting the cert-manager namespace would just never finish and be stuck in limbo...

Yeah, it wasn't fun in the early days.

Eventually, things (mostly) stabilized when it hit v0.12. There was still the odd hiccup where cert-manager would just fail to renew a cert (one time it was because the GCP service account expired... another time it was seemingly because of a cosmic bit flip...), but it (mostly) worked.

So when I read the news that they had finally released a v1.0 last year, I have dreaded going through with the upgrade process ever since.

Well, with all these other upgrades and improvements I was making, I figured I might as well bite the bullet and upgrade to (what I could only hope) is an actual stable version of cert-manager.

…

My god. It went splendidly. Damn near perfectly, really (and for those who know me, no, that was not sarcasm).

First of all, cert-manager has excellently documented all of the upgrade steps needed to move between versions. For example, here are the instructions for v0.12 to v0.13. Rather empty, which is a good thing!

In fact, moving all the way from v0.12 to the latest v1.3 (a jump that included 8 separate releases!), there was only one upgrade step I needed to do: from v0.13 to v0.14, you just needed to delete the deployments before applying the new ones.

So I literally just deleted the cert-manager deployments and applied the latest v1.3 manifests.

Worked without any other changes.

Honestly, I was flabbergasted. I literally set aside like 6 hours to do the upgrade (expecting everything to go wrong), but nothing did. Certs still worked, got renewed, no errors. Just, magic.

So... huge props to cert-manager for finally getting things sorted out!

Well, except for one thing...

CA Injector Errors

cert-manager runs three different deployments by default: the main cert-manager process that handles issuing and renewing certs, the webhook process that ensures the manifests we deploy are configured correctly, and the cainjector process that... presumably handles injecting a.. CA.

It doesn't really matter what the cainjector was doing, the point is that it was throwing a whole heck of a lot of errors. Like, hundreds of errors an hour.

Specifically, errors with the message "unable to fetch certificate that owns the secret", "Certificate.cert-manager.io \"[redacted]\" not found" (GitHub issue over here).

Basically, what seemed to be happening is that the CA injector would try to.. inject the CA into secrets that it couldn't find the corresponding Certificate resource for. Now, why wouldn't it be able to find the Certificate? Well, I'm not quite sure, but I have a good hunch...

See, as totally awesome as cert-manager is, it had (and still has!) this weird idiosyncrasy: when it issues a cert, it does so in a Kubernetes Secret resource. In order to make use of this cert, we need to give it to our Ingress resources so that ingress-nginx knows which cert to use for TLS.

However, because of our per-branch deployment scheme, our Ingress resources are spread across many different namespaces. But the cert Secret is only in the cert-manager namespace... and the Ingress can only read cert secrets from its own namespace...

So how can we use the cert? Well, based on my understanding at the time (which seems to still hold today), you're supposed to just copy the secret between namespaces.

A little brute force, but hey it works. Especially when you throw in the ingress-cert-reflector to handle automatically copying the secret to every namespace.

This worked fine for quite a while. Until I finally decided to check the logs and noticed the cainjector complaining about every secret in every namespace it wasn't in. Presumably because there was no corresponding Certificate resource in each namespace that the secret was copied to.

But of course, since my HTTPS was working just fine, I didn't really care all that much, so I just let it be.

But when I finally upgraded cert-manager and found that the cainjector was still throwing these errors, I figured I might as well try and fix it properly.

And that's when I finally learned that ingress-nginx has a --default-ssl-certificate option.

I have no idea if this option existed back when I first set everything up, but if it did, I regret not stumbling onto it sooner.

As the name suggests, it allows you to specify a default SSL cert by pointing at a secret in any namespace. WOW. How crazy is that!?!

So bam, just like that, I no longer need the ingress-cert-reflector to copy my cert secrets around. Just specify a default and we're all good.

There's just one problem... you know how everything up till now has been "one thing to leads to another"? Well, this also led to another thing.

I figured, heck, while I'm over here re-configuring ingress-nginx and upgrading cert-manager, why don't I also upgrade ingress-nginx?!?

Upgrading ingress-nginx

You remember those 6 hours I had allocated for the cert-manager upgrade? This is where they went.

At the time, I was running ingress-nginx v0.17.1…

The latest version was v0.45.0. Only, like, a couple dozen versions to jump up. No biggy right? If cert-manager went smoothly, surely so would ingress-nginx?

Nope, no it would not. I mean, it could have been worse (what can't be?), but this upgrade wasn't fun.

First of all, the upgrade docs were much worse than cert-manager's:

To upgrade your ingress-nginx installation, it should be enough to change the version of the image in the controller Deployment.

"Should be enough", yeah right!

First of all, at some point in the past, ingress-nginx changed where they hosted their Docker images from Quay to GCR. So... I literally can't just change the version because I'm on an old enough version that I used the Quay image, and they don't publish the latest images to Quay anymore! Strike #1.

Secondly, since I use the static manifests, I can just check that there are fairly significant changes between my (old) version's and the latest's. So... changing 'just' the image, not quite. Strike #2.

Then again, it's hard to blame them for not having upgrade docs for such an old version. It's more commendable that cert-manager does.

Anyways, I figured the easiest way to upgrade is to just grab the latest manifests and YOLO it. Obviously, I ported over whatever config changes I had made, but other than that, I literally just YOLO'd it.

And.... it worked! Or at least, it seemed to work. Containers weren't throwing any errors. Site was coming up just fine.

Except then I tested this one thing... I navigated directly to app.ufincs.com (the subdomain for the app itself, vs ufincs.com which is for the marketing site) and instead of being redirected to the login page, I found something strange: I was presented with the home page of ufincs.com instead.

Uh oh.

Then I tried navigating to the login page via the Login link on this strange home page. Worked.

But then I tried navigating directly to the login page via ufincs.com/login: infinite request loop.

Oh shit.

First thing I did was roll back the ingress-nginx upgrade to make sure this wasn't somehow already how it worked. It shouldn't have been, but since the marketing site does use redirects to get the user to the app, I figured it's possible that I had somehow broken it earlier.

But it was fine with the old ingress-nginx. So it's definitely the new version's fault.

But what could it be? What could have changed between versions of Nginx to cause this strange behaviour?

My first thought? Caching.

See, we make use of Nginx to cache the static assets that are served from the Express servers of the marketing/app services. Nginx is way faster at serving that content than forcing the Express servers to do it all (and I didn't want to bother setting up a proper CDN).

So somehow, the root route for the marketing site was being cached and served for the app service. And then... something was being cached incorrectly to cause the infinite login request loop.

After some curl debugging to check the cache hit/miss status of various pages, flushing the Nginx cache, trying to inspect the Nginx cache, and finally just setting up separate caches for the marketing and app services, I deduced that it was indeed a caching issue. The only question was why.

My assumption was that, somehow, the key used to index cached requests had changed between Nginx versions. Digging into the proxy_cache settings, I found the default value for the proxy_cache_key:

$scheme$proxy_host$request_uri

Well, the $scheme should be the same between requests; it's just HTTPS. The $request_uri would also be the same since it's just the root route /.

So, by logical deduction, $proxy_host had to have changed to somehow be the same and cause root requests to different services to overlap.

Well, according to this, $proxy_host is defined as the "name and port of a proxied server as specified in the proxy_pass directive".

And how is proxy_pass defined? Well, the only way to find that would be to check the Ingress' generated nginx.conf. A quick kubectl exec -it <nginx pod> -- cat /etc/nginx/nginx.conf later and we look what we've got here! I'm not going to post the whole config (cause it's huge), but guess what? Every instance of proxy_pass is the same!

They're all defined as:

proxy_pass http://upstream_balancer;

Well, what the heck is upstream_balancer? Searching through the rest of the config file, we find this:

upstream upstream_balancer {
    ### Attention!!!
    #
    # We no longer create "upstream" section for every backend.
    # Backends are handled dynamically using Lua. If you would like to debug
    # and see what backends ingress-nginx has in its memory you can
    # install our kubectl plugin https://kubernetes.github.io/ingress-nginx/kubectl-plugin.
    # Once you have the plugin you can use "kubectl ingress-nginx backends" command to
    # inspect current backends.
    #
    ###
    
    server 0.0.0.1; # placeholder
    
    balancer_by_lua_block {
        balancer.balance()
    }
    
    keepalive 320;
    
    keepalive_timeout  60s;
    keepalive_requests 10000;  
}

Because all of the proxy_pass values are the same, the $proxy_host value is the same for all requests. Which makes our cache keys overlap at only the route level, rather than the host + route level.

Looks like we found our culprit!

But now, how do we fix it? Presumably, we need to replace $proxy_host with something else, since the $scheme and $request_uri should be fine.

Thankfully, the proxy_cache_key directive lists the following as an example:

proxy_cache_key "$host$request_uri $cookie_user";

This would be for adding cookies to the cache key. But what's more important is the use of the $host variable. According to the docs, it should take the hostname from the request itself. Which should suit our needs just fine!

And yep, changing the proxy_cache_key to:

proxy_cache_key "$scheme$host$request_uri";

Seems to fix the issues. No more infinite request loops, no more wrong home pages. Success!

And that's how my 6-hour cert-manager upgrade turned into a 6-hour ingress-nginx upgrade.

What, did you think this was the last of the DevOps detour? Oh no, we've still got a lot more to go. From Docker image caching problems to improving our monitoring suite, there's probably another 2 or 3 parts to go.

Now whether or not I get the ambition to write them all is a different matter... So, stay tuned for potentially more rounds of the DevOps Detour!

On Matters Concerning my Existence

Discussion about this post