Sunday 25 September 2022

Why I'm (mostly) not using docker

I'm somewhat cautious of docker. Rather than reposting the same stuff on Reddit, I thought it would be quicker to list the reasons here and then just post the URL when it comes up.

I'm running a few hundred LXCs at $WORK. It's a really cheap way to provide a computing environment. And it works. But I'm more cautions about docker. Docker is not supported as a native container provider on Proxmox - which is where most of my VMs and LXCs now live - but that really has very little bearing on my concerns. I do have VMs running docker - more on that later.

The first problem is that its designed for running appliances. Some software fits very well into this model - but such software is usually edge case. For databases I do not want lots of layers of abstraction between the run time and the storage. For routers/firewall I want the interfaces to be under direct control of the host. For application and webservers I want to be able to interrogate memory and cpu usage on a per-process basis. Working on docker containers feels like key-hole surgery. It might be very hi-tech but its awkward and limiting. Conversely, I can have a (nearly) fully functional lxc host with very little overhead.

For a lot of people out there, the idea that you can just click a couple of links and have a service available for use sounds great. And it is. I've downloaded stuff from docker hub to try out myself. But I wouldn't run it in production. The stuff I do run in production has a well defined provenance - it has either come from the official debian/ubuntu repos or from the people who wrote the software. In the case of the latter, there are processes in place to check if the software needs updated. Conversely a docker container is built up of multiple layers, sourced from different teams/developers, most of whom are repackaging software written by someone else. In addition to the issue of sourcing software securely, the layers of packagers may also add capabilities to the container. It really might not be as isolated from the host as you think.

This lack of accountability is a growing concern - indeed Chainguard have released a Linux distribution specifically to address the problem. Wil it solve these problems? Its too early to tell.

So really the only sensible way to use docker in an enterprise environment is to build the images yourself. That demands additional work and high level of skill in another technology just to get the same result.

BTW - the docker images I've used to triall software and decided to take into production have been implemented as conventional installs on LXCs or VMs.

Tuesday 14 June 2022

Ditching Redhat

As I've written elsewhere, when I started my current gig in 2019, the infrastructure and IT Operations turned out to be very different than what I had been told when I handed in my notice to my previous employer. The description I keep going back to is that it was a fractal horror story.

The majority of the systems were Centos, along with some Redhat boxes (pre-dating the "Enterprise" moniker) and a few Whitebox Linux machines. I'd ever heard of Whitebox before I arrived here. The Whitebox Linux distribution was a free version of Redhat Linux (again without the "Enterprise"). These dated from the last millenium.

Upgrading the systems in place was going to be a lot more work than replacing them. In the majority of cases they were so old that the repositories were no longer online. Further, the key server components were all built from tarballs. And every host had a different version of the base operating system, and different builds/file layouts. As a result there was benefit to sticking with Redhat/Centos. By reverting to repo based software installations (wherever possible) we would be deploying updates from a single, trusted channel.

So would it better to replace this with something else? As I was deliberating this, IBM bought over Redhat. And while there was nothing to suggest that the future for Centos might be any different this was not a good time to be commiting to Centos as the strategic platform going forward. Further, I've seen the impact of trying to run Linux with the SELinux targeted policy in an Enterprise environment.

I did consider the possibility of migrating to a docker (or similar) infrastructure. But the disadvantages and risks from this massively outweighed any benefits.

My initial thought was to go for a purely rolling release model. The systems in place had not been upgraded because management
- was terrified of breaking stuff
- did not have enough people/right skills to implement upgrade cycles
While the level of risk of breaking things with a rolling release compared to a staged release was probably the same, the rolling release model would mean that pain would be more spread out. However there are no large scale rolling release distros geared towards enterprise environment. Is rolling-release in the enterprise simply an oxymoron?

While I had previously run a datacentre primarily on Suse, that was a long time ago. Yast is still a fantastic toolkit but Suse seems to have become less relevant in an enterprise server role. Debian stable would have been a good fit - with the advantage of a huge range of software available from official repositories, however I didn't want to move to a platform with even less frequent upgrade cycles that RHEL.

This left Ubuntu as the next obvious choice.

In addition to upgrading the hosts, I also wanted to move the infrastructure from a 1990's dial up ISP to something more akin to a modern, integrated environment; grouping functionality by its technical role with structured dependencies instead of building the same wheels over and over again. A priority was building a proper DMZ to sit between the applications and the internet. This meant I could: 

  • avoid exposing the ancient machines directly on the internet
  • centralize/automate certificate managementupgrade all the sites to HTTP/2 without having to replace the platforms
  • implement WAF-like security controls
  • implement useful analytics


So after some initial testing, I built out a cluster of reverse proxies using Ubuntu and nginx.

The impact was huge.

Changing to HTTP/2 resulted in page load speeds roughly doubling on every service.

Letting the lightweight proxy handle the long haul communications freed up processes (and therefore memory and CPU) on the origin servers resulting in a massive increase in capacity. Previously the moitoring would light up like a christmas tree as load ramped up every morning, swap files filled up and response times plummeted. This almost completely eliminated the issues.

I expect I would have seen these same benefits regardless which Operating System/distribution I had chosen, but Ubuntu has proved to be fast, reliable and very low effort.

As the modernization program has progressed, every investment in re-platforming/upgrading has paid back multi-fold. I've only run into two issues which couldn't be solved on the path I'd planned.

The first was Solr. The version in the Ubuntu repos (inherited from Debian) is old and very badly organized. For this the tarball package proved to be a mch better choice.

The second was FreeIPA. As part of the modernization we needed to replace the old OpenLDAP installation. Moving to FreeIPA provided an integrated solution which could easily support additional features (notably sudo). However the version available from repo at that time was not very current / stable. After trying various options I went with Alma Linux on the hosts for the FreeIPA service.

I guess I'm digressing from where I started to talking about architecture.

While these changes have demonstrated their value, the choices here were, in many cases, the exact opposite of the obvious solution:

- Solved a work overload by taking on more work
- Solved an inability to patch by maximizing the frequency at patches were available/applied
- Simplified the management of the infrastructure by introducing more components/complexity



Tuesday 12 April 2022

Password Manager 2

Having previously decided to try out Syspass, I must say I'm disappointed.

In terms of the broad design it gets a lot of things right. But the implementation is particularly poor and buggy. It is built as a single page application and if you accidentally hit the back button or close your window then its rather painful to get back to your session (at least as something you can interact with). Operations will randomly fail then succeed when re-invoked. The permissions/access model around the API make it unsuitable for integration with clients in most cases. And the browser plugin would not work at all for me.

I'm still using it just now - it's better than the spreadsheet it replaced. And I've gone to the trouble of writing scripts to very the passwords and export the data to Keepass

I was excited to learn of VaultWarden - and open source implementation of BitWarden. The current version would not compile on Ubuntu 20.04LTS (required newer version of Rust) so I tried out the docker version. The software has no support for user groups which would make policy management an enormous job.

Why is this so hard people!


Tuesday 25 January 2022

Proxmox Backup Server Evaluation

 I'm already running Proxmox Virtualization Environments (3 x PVE clusters) at $WORK. Currently these just dump backups to NFS, but I am evaluating more sophisticated options in advance of replacing the existing Simplivity clusters with Proxmox. Proxmox Backup Server has (not surprisingly) very good integration with PVE and hence an obvious choice.

The first (actually second - the first was a prototype and got deleted) PVE cluster provides our Dev and Test environments, so for the evaluation I just created a VM there. When I get around to implementing the service properly it will be on separate hardware.

Performance

Currently the daily backups on the dev/test cluster are getting close to overrunning their out-of-hours time slot - so a faster backup is critical for the next environment. My test cases were a couple of recent LXC containers and a very old Centos 5 VM (on the basis that if it works with that it should work for anything!).
There were little differences in time for the backups comparing NFS with LZO compression, NFS with ZSTD compression and PBS (although more than the graph suggests - that's a log scale) until I ran a second backup of the VM using PBS. Backup time for 55Gb VM dropped from 15 minutes to 3 seconds! For that reason alone, PBS looks the winner here. Sadly, the LXCs did not exhibit the same speed up - apparently this an architecture thing due to a VMs ability to track dirty pages.

Data size

In terms of the data footprint, there was again, not much difference between the methods until I got to the second round of backups via PBS. The de-duplication offers a huge gain for VMs and for LXCs. I expect this reduction in data volumes to carry over to the replication when I look at offsite replication.

I am somewhat wary of de-duplication having fallen foul of its implementation in Simplivity a couple of times, however on Simplivity, the virtual disk is included in the same de-duplication dataset as the backups - so one bad block not only takes out your VM but also all your backups too. With the PVE + PBS model, the virtual disk is separate from the backups.

Utility

The ability to mount a PBS backup and extract a single file or directory tree is really useful.

What's not to like...

I've not fully grokked the backup pruning but the facility is an elegant solution to the problem of maintaining older backups. 

Although I've done 2 rounds of backups on PBS, I can currently only see one version in the user interfaces of both PVE and PBS. 

Like local PVE backups, the naming is based on the VM/T ID rather than its given name. If you run an environment like mine that means you need to maintain a lookup table (I've already got this scripted against the API).

The lack of a speed up for LXCs is doubly disappointing - unlike VMs, the LXCs need to be offline for a large part of the backup, but I'm happy with the trade-off of using mostly VMs in the new environment. In my research, I also came across a post by Stefano Marinelli who is using Borg backup for his LXCs with some success. However I'd prefer not to start down the road of providing different solutions to what should be the same problem - particularly when its about provisioning something I hope not to depend upon!
 

Update

Although the first backup was visible in the PBS after it was run, on subsequent runs it was not: after the third run, two versions were visible, after the fourth, three....

Bit of a strange one, but effectively resolved.