Building a resilient database for Fastly’s Certificate Authority

By admin
Building a resilient database for Fastly’s

Certificate Authority
ORG

This is the latest in a series of posts exploring how we built Certainly , Fastly’s new publicly-trusted

Certification Authority
ORG

. Previously we’ve described some of the architectural decisions behind Certainly.

Today
DATE

we’ll explore one of the major challenges stemming from the decision to create a cloud-like ephemeral environment in which systems are regularly and automatically destroyed and rebuilt.

Like every other

CA
GPE

, at the heart of Certainly operations lies a robust, reliable and resilient database. Certainly’s well-designed database architecture allows for efficient storage and retrieval of certificate data and provides the ability to scale to meet the growing business needs. We currently use MariaDB with the InnoDB database engine along with replication enabled across the data centers. Replication helps us to achieve data redundancy and builds confidence in our operations to recover easily in case of disasters. But this wasn’t always the case. Before adopting MariaDB replication, we relied on the MariaDB

Galera
PERSON

cluster setup, which only proved to be painful, operationally burdensome, and unstable. Every day we would find ourselves facing new problems, which eventually necessitated the transition to

MariaDB
PERSON

replication. Here are some of the benefits of this decision:


1
CARDINAL

. Reduced Complexity


Galera
PERSON

cluster’s multi-primary architecture required all the database nodes across all of our facilities to be in constant communication to ensure data consistency and synchronization.

MariaDB
PERSON

replication is a single primary and multiple replicas-based design. Having a single database node act as write primary significantly simplifies the setup and maintenance. On top of that, the configuration is easier with less operational overhead. With the decreased complexity, we have seen a significant drop in the number of alerts and issues which in turn have led to substantial improvement in the effectiveness and morale of the Certainly

SRE
ORG

team.


2
CARDINAL

. Dynamic Scalability

As we were getting ready to launch, we had to ensure that scalability would not become an issue as we grew our user base. Although the

Galera
PERSON

cluster provided us with synchronous multi-primary replication, it came with limitations. Adding new nodes to the existing cluster was a complex and time-consuming process. In contrast,

MariaDB
PERSON

replication provides us enough flexibility to add new read replicas as needed without impacting the existing primary’s operations. This helps with scaling our read operations horizontally without making it a burden on the team.


3
CARDINAL

. Use Case-based

Replicas
ORG

We can create additional replicas based on use cases such as read operations, backups, reporting & analytics, and more, thanks to the new database design. Contrary to

Galera
PERSON

, adding or removing nodes no longer affects operations, therefore the primary can continue operating as efficiently as before.


4
CARDINAL

. Ephemerality

Certainly embraces the concept of ephemerality wholeheartedly. What that means for our database nodes is that they are rebuilt from scratch on a regular cadence. The general notion is that ephemerality applies to stateless components and databases are designed to store and persist data. We attempted to break that notion with our innovative infrastructure design which provides us with more agility, scalability and resilience. As stated earlier,

Galera
PERSON

needs its nodes to be in constant communication which did not go well with the ephemeral nature of our infrastructure. During rebuilds, nodes were frequently exiting and rejoining the cluster, triggering

Galera
PERSON

to recalculate quorum. This, coupled with various known factors such as backups and network connectivity, along with certain unidentified elements, resulted in

Galera
PERSON

losing quorum and shutting down more often than expected. If the database is down, Certainly is down. With the new design, we can take down the replicas for rebuilds without worrying about having an adverse effect on the primary node, making the system unquestionably more robust.


5
CARDINAL

. Effortless

Failovers
PRODUCT

To manage failovers, we employ

MariaDB Orchestrator
PERSON

, an open-source solution. Around the orchestrator, we have created specialized tools that enable health checks, failure detection, performance degradation, and failovers as necessary. Orchestrator maintains an up-to-date and accurate view of the database cluster’s topology, including the roles of each node (primary or replica) and their relationships. This information helps us manage failovers and keep the database in good shape. Additionally, it aids in reducing the possibility of data loss and any divergence during failover.


6
CARDINAL

. Future Flexibility

With our current DB design, we have the flexibility to expand and change along with the

PKI
ORG

industry’s constantly evolving landscape.

To conclude, the transition to MariaDB Replication from

Galera
PERSON

was a pivotal decision that has provided us with enhanced scalability, improved performance and simplified operations. It was an informed choice that has positioned us well to face new challenges. We feel confident that we are equipped with the tools necessary to evolve and expand our reach to global customers.