Preventing outages with resilient architectures

By admin
Preventing outages with resilient architectures

Fastly’s resilient architecture principles prevent outages, mitigate severity, and deliver on our availability promises without compromising performance. We systematically eliminate “single point of failure” risks and always,

ALWAYS
ORG

prioritize distributed and resilient solutions that are built to scale.

We did not expect the reliability and distributed architecture of our control plane to be the key differentiating feature that everyone wanted to hear about

this week
DATE

, but here we are. No cloud vendor is immune to outages, but Fastly continuously and proactively acts to address this risk by building extra resilience and redundancy into our network, control plane, and data plane. This work is designed to protect us against catastrophic failures by preventing them from occurring, and mitigating the severity of their impact if they do occur.

In the spirit of

#
CARDINAL

hugops , we have a lot of empathy for our colleagues in the industry who have experienced outages recently. It’s all of our worst nightmares. We decided to write about our approach to resilience

today
DATE

because it’s a question we’ve been asked numerous times by customers in

the last week or so
DATE

.

Now, let’s get into the decisions we’ve made, and the work we started

several years ago
DATE

to make Fastly more resilient. We’re talking about resilience against everything, from black swan events (like the complete loss of a datacenter), to more common scenarios like internet outages or sophisticated DDoS attacks. This means building resilience into the control and data planes, but also into Fastly’s overall network, traffic handling, and more. By the end of this post you’ll understand why resilience is Fastly’s middle name. And why redundancy is our

first
ORDINAL

name… and why redundancy is also our last name too.

Distributed solutions reduce single points of failure


Two
CARDINAL

of the most important principles we apply throughout the Fastly platform are to build systems that are distributed and to remove single points of failure.


About three years ago
DATE

Fastly began a formal process to continuously assess and strengthen our architecture that was led by an internal cross-functional team called

the Fastly Reliability Task Force
ORG

(RTF). The

RTF
ORG

meets on a regular basis to triage, evaluate, discuss, and prioritize the mitigation of existential risks to the platform. This forum has been hugely instrumental in driving major improvements, while constantly planning what to tackle next. Part of the impetus for launching the task force was that we recognized that Fastly could have been (in the past) unprepared to handle a data center power outage. In order to address the risks we identified, we started a (truly) massive company-wide initiative that we lovingly codenamed “Cloudvolution” for

two
CARDINAL

reasons.

First
ORDINAL

, we wanted to apply an evolutionary, transformative leap to the way we run our entire platform and achieve a more resilient, multi-cloud, multi-region architecture. And

second
ORDINAL

, because we did not think we would ever be saying that name publicly and sometimes we like silly names. Not everything needs to be an acronym.

Control and data plane resilience

“Cloudvolution” was intended to strengthen both our control plane and our data plane, and bring high availability to core platform services for additional platform resilience. A key goal was to iteratively evolve system design abstractions to improve resilience, have clearer service boundaries, and set ourselves up for easy scaling to the next stage of customer growth (and beyond). Basically, we wanted fewer dependencies, less risk, fewer points of failure, and more resilience.

We knew that this architecture upgrade would be key to maintaining reliable service in the event of a catastrophic failure.

Today
DATE

, our control and data planes are multi-cloud and multi-region. We worked hard to make it so that Fastly does not depend on a single datacenter, a single availability zone, or a single cloud provider. Our control plane is run on

two
CARDINAL

independent and geographically dispersed regions with a warm failover to the secondary region if needed. Similarly, our data plane (powering our

observability & analytics
ORG

) also has

two
CARDINAL

independent regions and a warm failover, but resides in a separate cloud provider from the control plane. This effectively limits the ‘blast radius’ of a catastrophic event, making it significantly less likely that something could take out both our ability to observe (data plane) and take action (control plane).

We also ensure that our data centers have failovers in place from grid power to uninterruptible power supply (UPS) batteries with generator backups, but it’s not enough just to have these in place on a checklist. As part of our

Business Continuity Planning
ORG

(BCP) we also execute failovers between active and standby regions to continuously test these capabilities in the event of a regional cloud provider outage.

Once we shifted our control and data planes to this better architecture we knew it was important to make it easy for our engineering teams to build on so that EVERYTHING Fastly built would be as resilient as possible. We made our control and data planes available as an integration platform so that engineers could spin up new products and features that are safe, scalable, and resilient out-of-the-box – even the smallest tests and alpha projects. This helps us further reduce risk, as smaller engineering teams are not forced to recreate their own version of core systems to launch something new. It’s easy for everyone to build in the safest and most resilient way, but it doesn’t stop there.

The RTF continues to identify areas to improve even as we have tackled our original priorities, and we are already working on the next iteration of our control and data planes. These improvements can take

years
DATE

to fully implement, so it’s important to start working before you need it. We are already making progress on plans to further decouple sticky system abstractions in the control plane over

the coming year
DATE

, which will not only make it more resilient, but also help us to accelerate product development.

Network resilience (beyond the control plane)

Data center failures that impact the control and data planes of a platform are not the only challenges to face when running a global edge network that promises low latency and high availability. Now that we’ve talked through the ways in which our control and data planes are built to be distributed and resilient, here’s a look at the ways in which the Fastly edge network is built to avoid single points of failure, and to mitigate problems automatically, immediately, and intelligently. Most of the time when we talk about resilience at Fastly, we’re talking about our edge network and content delivery services, so let’s dig in.

Resilient handling of traffic and network outages

Traffic anomalies, latency problems, and internet outages are

daily
DATE

realities for a network the size of Fastly. On

any given day
DATE

, internet transit providers collectively experience anywhere from just a few issues all the way

up to hundreds
CARDINAL

of these temporary, short-lived, connectivity or performance degradations. In aggregate these are referred to as “internet weather.” Some internet outages are large enough to capture global attention. Most are (relatively) smaller events that pass quickly, but even the “smaller” weather events cause latency and performance degradation along the way that can have serious impacts.

Current industry best practices often employ techniques like

Border Gateway Protocol
ORG

(BGP) routing changes, but because

BGP
ORG

doesn’t have any application-level failure detection capabilities built into it, it can take a long time for a problem to get resolved –

sometimes hours
TIME

. A monitoring or observability system outside of

BGP
ORG

has to detect the issue, infer the problematic routes, plot a solution, and then use BGP to issue instructions to change the network topology and eliminate faulty routes. Once those instructions are issued,

BGP
ORG

is fast to fix the issue, but all the stuff that comes before it can take

minutes or hours
TIME

to get to that point. So

BGP
ORG

isn’t very effective for fine-tuning changes around smaller interruptions or outages in the network. Most of the time the issue has resolved itself by the time the

BGP
ORG

change would have an impact, and it does nothing to help the sites and applications that suffered real consequences for every

second
ORDINAL

of the outage, and just had to wait for things to work again.

At Fastly, the fact that we don’t control the entire global network is no excuse for finding ways to provide better and more resilient service for our customers. Here are some of our innovations for providing our customers, and their end users, with the fastest, most reliable experience available. These advances in edge network operations are only possible due to our modern network architecture and the fact that we have a truly software-defined network that allows us to apply logic at any layer and programmatically scale and adjust networking flows as desired to circumvent internet problems and ensure uptime and reliability.

Keep an eye out for these common themes:

1
CARDINAL

) our systems are automated, self-healing and can respond immediately without waiting for human intervention, and

2
CARDINAL

) they are provisioned across our entire fleet of POPs.

Removing dependencies for resolving “internet weather”

We love problem solving, so the worst thing about internet weather is that it’s not within our control to fix the actual source of the problem! It’s something happening out there on a part of the global internet infrastructure that someone else owns or manages, and whatever event is occurring is out of our control. But certain things ARE in our control, and we’ve developed ways to improve our service and route around bad weather. The

first
ORDINAL

is a technology we call “ fast path failover .”

Fast path failover automatically detects and re-routes underperforming edge connections at the transport layer, allowing us to mitigate the impact of internet weather issues that are occurring outside of our own POPs. A lot of internet weather isn’t a full break – often there’s just a lot of latency or other issues, but the link in the network is still technically connected, just heavily degraded, and this causes problems. The standard approach to remediation uses the BGP to route traffic away from broken internet paths, and it does an ok job for complete breakages, but it’s a terrible solution for degraded connections.

When a link along the path becomes unavailable,

BGP
ORG

can withdraw the routes involving that link and signal alternative paths, if available. This triggers ISPs to reroute traffic and bypass the issue. But in situations where a path is heavily degraded, but not entirely failed, a BGP route withdrawal might not be triggered at all. Sometimes the service provider has to detect and manually reroute traffic to mitigate the issue, and this process can take anywhere from

several minutes
TIME

to

a few hours
TIME

depending on the provider.

Fast path failover doesn’t wait for

BGP
ORG

to fix things for Fastly customers – if something is failing, we want it to fail fast and reroute fast and start working again – fast! Fast path failover automatically detects and re-routes underperforming edge connections without waiting for a network-wide resolution issued via

BGP
ORG

. We don’t need to wait for peers or transit providers to tell us that a path is broken, we can see it for ourselves. And we don’t need to wait for their routing solution to try a different route from our end. Our edge cloud servers can determine if connections are making forward progress, infer the health of the internet path, and select an alternate path quickly whenever necessary.

In another win for distributed architecture we get even faster routing because we don’t rely on centralized hardware routers that bottleneck routing decisions for an entire

POP
ORG

. Our edge servers can act faster in a distributed fashion to make routing decisions on a per-connection basis. This enables precise failover decisions that only reroute degraded connections without impacting healthy ones. This mechanism is remarkably effective at identifying and mitigating internet weather conditions that are typically too small and too short to be accurately detected and mitigated using standard network monitoring techniques. Read more about fast path failover .

To go even further, in cases where the internet weather is sufficiently central in the network topology, there may be no alternate path that exists to move the traffic away from the failed route. Other providers can get stuck behind these issues with no viable alternatives, but we simply don’t take “no” for an answer. Fastly has massive transit and peering diversity that significantly reduces the risk of getting caught behind network bottlenecks when trying to reach our end users.


Smart
ORG

, automated traffic routing

While fast path failover improves connectivity for content requests from Fastly’s edge cloud platform moving across parts of the internet we don’t control, we have also added

Precision Path and Autopilot
ORG

to improve performance across parts of the network that Fastly can control.


Precision Path
PRODUCT

is used to improve performance across internet paths between customer’s origin servers and our network, and

Autopilot
ORG

is our automated egress traffic engineering solution. They do amazing things when used in combination, and they let us react immediately without needing to wait for a human to analyze and determine a plan, and this is critical because reacting faster prevents issues from cascading and affecting more of the network.

Precision Path

Precision Path continuously monitors all origin connections in every Fastly POP worldwide. When it detects an underperforming origin connection (due to internet weather, for example), it automatically identifies all possible alternative paths to that impacted origin and re-routes the connection to the best alternative in real-time. We can often re-establish a healthy origin connection before

5xx
CARDINAL

errors get served to end users, effectively fixing network issues so fast that it’s like they never existed. Our real-time log streaming feature can also be used to monitor for origin connection rerouting events that may occur on Fastly services.


Precision Path
ORG

also focuses on reliably delivering content to end users from our edge cloud platform. When delivering this content, we track the health of every TCP connection. If we observe connection-impacting degradation (e.g., congestion), we use fast path failover to automatically switch delivery to a new network path and route around the issue. This automatic mitigation is enabled by default on all of our POPs and applies to all Fastly traffic. No additional configuration is required.

Autopilot

Autopilot is what enabled us to deliver a record of

81.9 Tbps
QUANTITY

of traffic during

the last Super Bowl
DATE

with

zero
CARDINAL

human intervention, requiring no manual interventions over the course of this high traffic, and high stakes event. Since

February 2023
DATE

we’ve had

many other days
DATE

where traffic has exceeded

Super Bowl
EVENT

levels to set new records, so any day has the potential to be as big as a

Super Bowl day
DATE

. This ability to scale is not just useful once per year. It’s in use

every day
DATE

, all year round, optimizing our traffic and maximizing Fastly’s efficiency.

Similar to fast path failover,

Autopilot
ORG

was built to address shortcomings in the BGP protocol.

BGP
ORG

has a “capacity awareness gap” – it can only be used to communicate whether an internet destination can be reached or not. It cannot tell whether there is enough capacity to deliver the desired amount of traffic or what the throughput or latency would be for that delivery. It’s like if a courier said they could deliver a package and took it from you, only to find out later that it didn’t fit into their car.

Autopilot addresses this issue by continuously estimating the residual capacity of our links and the performance of network paths. This information is collected

every minute
TIME

via network measurements and used to optimize traffic allocation so that we can prevent links from becoming congested.

Precision Path
PRODUCT

is lightning fast, but it’s mostly about moving away from bad connections – it doesn’t “know” a lot about the new connection when it makes those decisions. Autopilot has a slightly slower reaction time than

Precision Path
PRODUCT

, but it makes a more informed decision based on

several minutes
TIME

of high resolution network telemetry data. Rather than just moving traffic away from a failed path (like

Precision Path
PRODUCT

), it moves larger amounts of traffic toward better parts of a network.

Working together,

Precision Path and Autopilot
PRODUCT

make it possible to rapidly reroute struggling flows onto working paths and periodically adjust our overall routing configuration with enough data to make safe decisions. These systems operate

24/7
ORG

, but looking at the most recent

Super Bowl
EVENT

we can see

one
CARDINAL

example of the efficiency they provide. They rerouted

300 Gbps
QUANTITY

and

9 Tbps
QUANTITY

of traffic (respectively), which would have otherwise been delivered over faulty, congested or underperforming paths, and clogged up more of Fastly’s network capacity. These self-managing capabilities enable us to react faster and with higher frequency to potential failures, congestion and performance degradation issues on our network.

Lastly, while

Autopilot
ORG

brings many benefits to Fastly, it is even better for our customers who can now be even more confident in our ability to manage events like network provider failures or DDoS attacks and unexpected traffic spikes – all while maintaining a seamless and unimpacted experience for their end users. Read more about

Autopilot
ORG

and Precision Path .

Automated protection against massive DDoS attacks

Not all network issues are unintentional. A critical part of network resilience is being able to withstand massive Distributed Denial of Service (DDoS) events like the recent Rapid Reset attack. This attack continues to create problems around the internet, but Fastly wasn’t vulnerable because we had an automated system in place that was able to begin identifying and mitigating it immediately using a technique we call “

Attribute Unmasking
WORK_OF_ART

.”

Attribute Unmasking

DDoS attacks have gotten more powerful over time, as well as increasingly fast to scale. They often scale from

zero
CARDINAL

requests per

second
ORDINAL

(

RPS
ORG

) to

millions or hundreds
CARDINAL

of millions RPS after

just a few seconds
TIME

, and then they might end just as quickly – sometimes terminating in

less than a minute
TIME

. DDoS attacks are also becoming more sophisticated, like the recent Rapid Reset attack which relied on a characteristic of the HTTP/2 protocol that had not been previously exploited.

For most of the large platforms affected by

Rapid Reset
ORG

this was a novel attack that wreaked havoc, but Attribute Unmasking allowed us to rapidly, and automatically extract accurate fingerprints out of the network traffic when we were being hit with

Rapid Reset
ORG

, and it works the same way for other complicated attacks. Every request coming through a network has a huge number of characteristics that can be used to describe the traffic, including

Layer 3
ORG

and Layer

4
CARDINAL

headers,

TLS
ORG

info, Layer

7
CARDINAL

details, and more. Our system ingests the metadata from inbound requests on our network and uses it to tell the malicious traffic apart from the good traffic. This allows us to block attack traffic while letting legitimate traffic through.

For faster response times, DDoS protection is handled at the edge of our network, with detection and defense capabilities built into our kernel and network application layer processing stack. This is another instance of a distributed solution (just like with fast path failover) that is only possible because our network is completely software defined, allowing us to run functions in a more distributed fashion across our servers in parallel. Our system is also modular, so we can rapidly enhance our detection and mitigation capabilities as new classes of attacks are discovered, without needing to develop an entirely new mechanism to respond. When an attack like Rapid Reset attack comes along, we simply add a few new functions to our detection and response modules, keeping our response times incredibly short, even for novel attacks. Read more about Attribute Unmasking and the Rapid Reset attack .

Resilience is a process

There’s a lot of detail in this post, but the main takeaway is that at Fastly:

We prioritize efforts to think about what could go wrong BEFORE it goes wrong We allocate significant resources to improve our architecture We continually identify and eliminate single point of failure risks We find innovative ways to prepare for, and solve problems that occur outside of our control We consider this work to be continuous. We are always working to be prepared for

tomorrow
DATE

, and we are always asking ourselves what else could be done.