ThousandEyes has announced last year's most disruptive outages and what the industry can learn from them.
When Internet outages happen, they can be extremely disruptive to a business.
By preventing users from reaching applications and services, outages can cause major revenue and reputation damage.
While application delivery is dependent on many Internet Service Providers (ISPs), it also increasingly relies on a large and complex ecosystem of Internet-facing services — such CDN, DNS, DDoS mitigation, and public cloud.
These services work together to provide exceptional digital experiences to users and even brief disruptions can have a significant impact.
“At the same time, enterprises are increasingly relying on Internet transport to connect their sites and reach business critical applications and services. Gone are the days in which applications are solely hosted in private data centers and office locations are connected primarily by MPLS circuits,” says ThousandEyes product marketing director Angelique Medina.
“The Internet is replacing or supplementing services like MPLS as enterprises embrace SD-WAN technologies. As a result, the Internet is now effectively the enterprise backbone, which as a “best effort” transport can have significant yet unforeseen consequences for businesses.
Over the last year, ThousandEyes has reported on a number of large-scale outages that had ripple effects across the global Internet, impacting enterprises and consumers alike.
The most significant of these outages took place over the northern hemisphere summer and disrupted nearly every top tech company in some fashion.May 13, 2019 — China telecom outage reveals its global reach
While not the most disruptive outage of 2019, a global and fairly long lasting outage in China Telecom's network proved to be a harbinger of incidents to come, while also revealing a lesson in how China Telecom's reach extends far beyond mainland China.
For nearly five hours on May 13, 2019, China Telecom experience substantial packet loss across its backbone, primarily impacting network infrastructure in mainland China, but also affecting China Telecom's network in Singapore and multiple points in the US, including Los Angeles. Over one hundred services were disrupted. Though not exclusively impacting western sites and services, many users of major US brands such as Apple, Amazon, Microsoft, Slack, Workday, SAP, would have experienced disruptions over the course of the outage window.
This incident illustrated some important realities about China and its impact on the global Internet that many folks aren't aware of. Specifically, it highlighted that many of the censorship policies that apply to Chinese Internet users may actually be implemented far beyond China's borders and in countries that have very different attitudes and policies related to Internet use.June 2, 2019 — Summer of outages begins with Google Cloud
On June 2, 2019, Google Cloud Platform experienced a significant network outage that impacted services hosted in parts of us-west, us-east and us-central regions. This outage impacted Google's own applications, including GSuite and YouTube. The outage lasted more than four hours, which becomes notable given the criticality of its service to business customers.
Google issued an official report of the incident several days later. ThousandEyes vantage points were able to see the outage as it unfolded in real time, revealing its characteristics and scale ahead of more detailed information becoming publicly available.
Beginning at approximately 9am ET in the US, ThousandEyes observed 100% packet loss from global monitors attempting to connect to a service hosted in GCP us-west2-a. Similar losses were seen for sites hosted in several portions of GCP US East, including us-east4-c.
The complete unavailability of parts of Google's network, as seen by ThousandEyes, turned out to be due to Google's network control plane inadvertently getting taken offline. Google later revealed that during the outage period, a set of automated policies determined which services were or were not reachable through the unaffected parts of its network.
One of the most important takeaways from cloud outages is that it's vitally important to ensure your cloud architecture has sufficient resiliency measures, whether on a multi-region basis or even multi-cloud basis, to protect from future recurrence of outages. It's reasonable to expect that IT infrastructure and services will sometimes have outages, even in the cloud.June 6, 2019 — An unfortunate series of events takes down WhatsApp for many users
On June 6, 2019, a large number of users around the globe attempting to access the WhatsApp service experienced connectivity issues. ThousandEyes was able to immediately see that 100% packet loss was preventing the service's reachability.
Upon further analysis, ThousandEyes determined the root cause of this packet loss was a massive route leak that steered traffic to China Telecom — a service provider that does not forward any Facebook-related traffic.
The incident was triggered when a Swiss colocation company called Safe Host announced to the Internet that the best way to reach WhatsApp and thousands of IP prefixes was through its network, AS 21217. When Safe Host advertised these routes, they were accepted by China Telecom and further propagated through other ISPs such as Cogent.
Users whose traffic was routed to Cogent and ultimately handed off to China Telecom would have been completely unable to reach the service.
It's unclear why China Telecom would accept routes to a service that it censors, but what is clear is the lesson of this outage. BGP route leaks are not uncommon on the Internet. When you rely on the Internet, an ecosystem that is deeply interconnected and vulnerable, you must understand how it works and expect that a glitch in one service provider can have cascading effects on another.
The unfortunate reality is that business risks associated with BGP route leaks and other Internet flaws are greater given the modern enterprise and service delivery landscape.June 24, 2019 — Cloudflare users fall victim to routing mishap
Just a couple of weeks after the massive route leak that impacted WhatsApp users, the Internet experienced yet another route-related incident - this one far more damaging.
On June 24, 2019, for nearly two hours, a significant BGP routing error impacted users trying to access services fronted by CDN provider Cloudflare, including gaming platforms Discord and Nintendo Life.
ThousandEyes analysis found that a significant BGP route leak affected a variety of prefixes from multiple providers. DQE, a transit provider, was the original source of the route leak, which was propagated through Allegheny Technologies, a customer of both DQE and Verizon. Unfortunately, Verizon further propagated the route leak, magnifying the impact.
Sites served through the CloudFlare CDN were impacted for nearly two hours.
This major Internet disruption affected about 15% of Cloudflare's global traffic and impacted services like Discord, Facebook and Reddit. The route leak also affected access to some AWS services.
The root cause of the incident was eventually traced to DQE's use of a BGP optimiser software that created routes to Cloudflare services that were only meant to be used within DQE's internal network. When these routes were accidentally leaked to one of its customers, mayhem ensued.
This incident was yet another reminder of how incredibly easy it is to dramatically alter the Internet service delivery landscape. In a cloud-centric world, enterprises must have visibility into the Internet if they're going to be successful in delivering services to their users.July 4, 2019 — Apple Services impacted on fourth of July
On July 4, 2019, just before 9am PT, users connecting to Apple's website and some of its services, such as Apple Pay, began experiencing significant packet loss for a period of over 90 minutes. This issue prevented many users from successfully connecting to Apple. ThousandEyes route visibility demonstrated that the packet loss was caused by a BGP route flap. A BGP route flap is caused when a routing announcement is made and withdrawn in quick succession, often repeatedly.
While Apple services are certainly important for many Internet users, the fact that the incident occurred early on a holiday seems to have prevented the incident from sparking more than a few user complaints. The lesson from this incident is that outages don't happen in a vacuum. Sometimes even significant outages may go unnoticed (or conversely create significant uproar) simply based on their timing and context.September 6, 2019 — DDoS attackers target the Internet's knowledge base
On September 6, 2019, access to Wikipedia sites from around the world was disrupted for close to nine hours, the result of a massive and sustained Distributed Denial of Service (DDoS) attack. DDoS attacks can overwhelm their target's web infrastructure and also create congestion within service provider networks that can lead to packet loss. These effects are exactly what ThousandEyes observed when Wikipedia came under attack.
During the course of the incident, ThousandEyes saw a significant drop in HTTP server availability from around the world, as well as a dramatic increase in HTTP response times. As a result, users across many regions were unable to establish an Internet connection for ongoing communication with Wikipedia servers. ThousandEyes also measured packet loss of up to 60% from our global vantage points, a condition that would have further prevented access to Wikipedia sites.
While DDoS events are an unfortunate reality of operating on the Internet, organisations should have visibility into the scope, impact and behaviour of these events and be able to validate that DDoS mitigation steps are effective.