Argh!: nginx

As I've written elsewhere, when I started my current gig in 2019, the infrastructure and IT Operations turned out to be very different than what I had been told when I handed in my notice to my previous employer. The description I keep going back to is that it was a fractal horror story.

The majority of the systems were Centos, along with some Redhat boxes (pre-dating the "Enterprise" moniker) and a few Whitebox Linux machines. I'd ever heard of Whitebox before I arrived here. The Whitebox Linux distribution was a free version of Redhat Linux (again without the "Enterprise"). These dated from the last millenium.

Upgrading the systems in place was going to be a lot more work than replacing them. In the majority of cases they were so old that the repositories were no longer online. Further, the key server components were all built from tarballs. And every host had a different version of the base operating system, and different builds/file layouts. As a result there was benefit to sticking with Redhat/Centos. By reverting to repo based software installations (wherever possible) we would be deploying updates from a single, trusted channel.

So would it better to replace this with something else? As I was deliberating this, IBM bought over Redhat. And while there was nothing to suggest that the future for Centos might be any different this was not a good time to be commiting to Centos as the strategic platform going forward. Further, I've seen the impact of trying to run Linux with the SELinux targeted policy in an Enterprise environment.

I did consider the possibility of migrating to a docker (or similar) infrastructure. But the disadvantages and risks from this massively outweighed any benefits.

My initial thought was to go for a purely rolling release model. The systems in place had not been upgraded because management
- was terrified of breaking stuff
- did not have enough people/right skills to implement upgrade cycles
While the level of risk of breaking things with a rolling release compared to a staged release was probably the same, the rolling release model would mean that pain would be more spread out. However there are no large scale rolling release distros geared towards enterprise environment. Is rolling-release in the enterprise simply an oxymoron?

While I had previously run a datacentre primarily on Suse, that was a long time ago. Yast is still a fantastic toolkit but Suse seems to have become less relevant in an enterprise server role. Debian stable would have been a good fit - with the advantage of a huge range of software available from official repositories, however I didn't want to move to a platform with even less frequent upgrade cycles that RHEL.

This left Ubuntu as the next obvious choice.

In addition to upgrading the hosts, I also wanted to move the infrastructure from a 1990's dial up ISP to something more akin to a modern, integrated environment; grouping functionality by its technical role with structured dependencies instead of building the same wheels over and over again. A priority was building a proper DMZ to sit between the applications and the internet. This meant I could:

avoid exposing the ancient machines directly on the internet
centralize/automate certificate managementupgrade all the sites to HTTP/2 without having to replace the platforms
implement WAF-like security controls
implement useful analytics

So after some initial testing, I built out a cluster of reverse proxies using Ubuntu and nginx.

The impact was huge.

Changing to HTTP/2 resulted in page load speeds roughly doubling on every service.

Letting the lightweight proxy handle the long haul communications freed up processes (and therefore memory and CPU) on the origin servers resulting in a massive increase in capacity. Previously the moitoring would light up like a christmas tree as load ramped up every morning, swap files filled up and response times plummeted. This almost completely eliminated the issues.

I expect I would have seen these same benefits regardless which Operating System/distribution I had chosen, but Ubuntu has proved to be fast, reliable and very low effort.

As the modernization program has progressed, every investment in re-platforming/upgrading has paid back multi-fold. I've only run into two issues which couldn't be solved on the path I'd planned.

The first was Solr. The version in the Ubuntu repos (inherited from Debian) is old and very badly organized. For this the tarball package proved to be a mch better choice.

The second was FreeIPA. As part of the modernization we needed to replace the old OpenLDAP installation. Moving to FreeIPA provided an integrated solution which could easily support additional features (notably sudo). However the version available from repo at that time was not very current / stable. After trying various options I went with Alma Linux on the hosts for the FreeIPA service.

I guess I'm digressing from where I started to talking about architecture.

While these changes have demonstrated their value, the choices here were, in many cases, the exact opposite of the obvious solution:

- Solved a work overload by taking on more work
- Solved an inability to patch by maximizing the frequency at patches were available/applied
- Simplified the management of the infrastructure by introducing more components/complexity

"This seems to be generating some buzz" - a passing comment in $WORK's chat app - promoted me to go look at this in a bit more detail. As a systems admin, I generally let the devs guy worry about the health of the applications while I deal with the infrastructure, but this one is bad. Real bad. Like Corona virus for Java application servers. It even came from China (but kudos to the AliBaba guys for letting everyone know - this could have gone very differently).

I've never been able to work on Java developer timescales - and I didn't think this vulnerability would let me. So...

Fail2ban

I've got a small cluster of proxies fronting the web and application servers. These have fail2ban running which does a good job of keeping the script-kiddies out (really - I needed to put in a bypass for the company we subcontract the pen-testing to). So first off was a fail2ban rule:

#
[Definition]

failregex = ^<HOST>.*\"\${jndi:ldap://
ignoreregex =

But fail2ban reads the log files to get its input. The log files don't get written until the request is processed. It won't catch the first hit.

Containment

The exploit works by retrieving a malware payload from an LDAP server. So the next step I took was to add firewall rules preventing our application servers from connecting to ports 389 and 636 other than our whielisted internal LDAP servers.

Of course that's only going to help when the attacker is using an LDAP server running on the default ports. Bit it was worth doing. We were already getting attempts to exploit out servers, but they were crude / badly targeted. Until 14 minutes after I rolled out the firewall change. When we got hit by a request which would have triggered a successful exploit.

Prevention

The best mitigation (apart from applying the patch) is to set the "formatMsgNoLookups=true" option (hint for non-Java people out there - add this on the Java command line prefixed with "-D"). However according to the documentation I could find this only works on some version of log4j / it is far from clear just now if those versions are a sub-set or a superset of the versions which are vulnerable to the exploit, and I did not have time to go find out.

It seems obvious now, but there is a better way of protecting the systems. The proxy cluster uses nginx, so I went on to add this in the config:

if ($http_user_agent ~* \{jndi: ) {
return 400 ;
}

if ($http_x_api_version ~* \{jndi: ) {
return 400 ;
}

(note that the second statement may have a functional impact).

I don't know if I've covered the entire attack surface with this, but now I get to go to bed and our servers live for another day.

Argh!

Tuesday, 14 June 2022

Ditching Redhat

Friday, 10 December 2021

CVE-2021-44228 log4j RCE mitigation

Fail2ban

Containment

Prevention

About Me