Undertaking GOV.UK’s largest software infrastructure project

By admin
GOV.UK is an essential part of living, studying and working in the

UK
GPE

. For its

millions
CARDINAL

of users, GOV.UK appears as a single website, where people can move seamlessly from

one
CARDINAL

page or service to another.

However, behind the scenes, GOV.UK is made up of many applications that work together to produce the public-facing website. Between these applications and the computer hardware that runs them, there is system software that acts as a go-between. We sometimes call this system software the "platform" on which GOV.UK runs.

With a system software end-of-life deadline approaching, we took the opportunity to evaluate what platform was best for GOV.UK and its needs.

This was GOV.UK’s largest software infrastructure project since its launch. This blog post describes how we decided what to do, how we did it and ultimately how we made GOV.UK more secure, cheaper to run and easier to scale. By doing this, we directly contributed to

GDS
ORG

’s mission “to make digital government simpler, clearer and faster for everyone”.

Why we decided to modernise

The software that runs the GOV.UK website makes use of other software "underneath" it, such as an operating system (OS) and some configuration management software to manage and keep track of changes. From

2014 to 2023
DATE

, our OS was Ubuntu, and we managed our configuration with

Puppet
PRODUCT

.

By

2021
DATE

, this infrastructure software was approaching the end of its supportable lifespan. Though we still have commercial support in place, running an old operating system meant we were spending ever more engineering effort working around compatibility issues whenever we needed to update other software that GOV.UK depends on, such as Ruby-on-Rails.

We considered whether to upgrade to recent versions of Ubuntu and

Puppet
PERSON

(running on virtual machines), or to take the opportunity to modernise by running GOV.UK in containers.

For GOV.UK, the main advantages of containers are scalability and lower maintenance. We were using complex automation to deploy many different applications onto the same set of virtual machines, which made it difficult to add more machines ("scaling up") in response to increased website traffic. We were also spending a lot of engineering time updating the software on these virtual machines.

Moving to containers means that when we need to respond to surges of traffic, we can add capacity easily. During our work on COVID-19, we saw how important it was to be able to withstand these traffic spikes.

Upgrading the existing infrastructure would have solved our short-term problem, but would have represented a much smaller return on investment and taken at least as much effort as containerisation.

How the team worked

The team of engineers had varying degrees of experience working on projects like this, and had to get up to speed with a new technology to progress with the project. To establish healthy ways of working we, as team leads;

assigned workstream leads that would be responsible for prioritising their workstream backlog and having kick-off sessions with the entire team to help spread knowledge – this allowed people to ask questions about the approach and reduce any assumptions

implemented a pairing by default way of working to make sure engineers could learn from each other

protected the team from any time-intensive activities that weren’t directly related to the delivery of the work

adapted our regular updates as needed, by holding them less frequently and making better use of "asynchronous" communication such as

Slack
ORG

, which meant there was more flexibility in managing workloads

Testing ahead of go-live

By giving engineers responsibility for managing and prioritising their own stream of work and protecting them from other distractions, it allowed them the time and space to try things out. It also helped build their confidence to work autonomously.


One
CARDINAL

of our engineers suggested a bold idea which would prove a catalyst for the project’s success: what if we were to perform a trial-run on the real, live website? At

first
ORDINAL

this sounds like an unnecessary risk: why not wait until the whole system is fully working in a test environment?

By limiting the scope of our trial-run to just those components of GOV.UK that serve web pages to the public (as distinct from the other parts that help people update the content of the site), we were able to:

focus on a subset of the problem and a nearer-term goal providing an important milestone ahead of go-live

build confidence amongst our management stakeholders

discover some minor issues with the website that may have otherwise interfered with our eventual launch

During our trial runs, we solved problems as a team by mobbing on them. Mob programming as a team further helped build trust and confidence for individuals. By working through minor issues together as a team in a safe environment, it increased people’s ability to experiment with different solutions to solve problems.

Going live

Once the team was fully confident everything was in place, we selected

a week
DATE

to switch over when there weren’t any big planned government announcements. We reviewed our run book and roll back plans to ensure they were accurate and up to date. We made a list of useful contacts such as technical support and escalation contacts. We informed publishers and other stakeholders, allowing them time to ask questions.

On

the day
DATE

, we mobbed on the go live activities and kept a log of status updates. We kept our stakeholders updated regularly.

The launch went smoothly and the public-facing website continued working without any visible change, as we were hoping.

Impact

We’ve seen a positive impact to the organisation and our users, for example:

we can now scale GOV.UK more quickly in response to traffic surges

the new platform is cheaper to run

it’s much quicker and easier to roll out security patches

it’s easier for us to make changes to applications and configuration

we’ve eliminated a lot of repetitive maintenance work for GOV.UK’s developers, so that they have more time to spend on valuable features and enhancements

running modern software means we can recruit engineers from a larger pool of talent

What’s next

In the GOV.UK

Platform Engineering
ORG

team, we’re working hard to get rid of a few behind-the-scenes odds and ends that are still running on the old Ubuntu/Puppet infrastructure so that we can finally switch it off. These are things like:

process automation that helps developers perform occasional maintenance tasks

the

daily
DATE

processes that automatically test that we can successfully restore from backups

Once we’ve done that, we’ll realise the rest of the value and savings by switching off the old infrastructure and no longer having to maintain it.

Now that GOV.UK is hosted on a modern, container-based infrastructure, there are a lot of exciting improvements that we can make to further increase developer productivity, reduce running costs and add even more resilience. Many of these things would have been difficult or expensive to achieve with the previous system. For example, we could:

enable developers to run a complete and accurate replica of GOV.UK "locally" on their workstations so that they can find and fix bugs more quickly and easily

improve GOV.UK’s suite of automated tests so that developers receive near-instant feedback if, for example, their change would break an important end-to-end user journey such as publishing a change to an article and seeing the change appear on the website

further enhance GOV.UK’s resilience by hosting in more geographic regions or cloud providers