Cloud computing

Learning from the October 20th AWS Outage: Early Detection with Eggplant Monitoring

A major Amazon Web Services (AWS) outage occurred on October 20, 2025, which affected internet services throughout the UK and EMEA regions and other parts of the world [1] [2]. The incident revealed that cloud infrastructure supports all digital services including banking apps and government portals and communication tools and smart devices. The incident demonstrated that cloud infrastructure supports all digital services, yet a single provider outage can trigger multiple system failures. In this blog, we’ll recap what happened during the outage, examine how third-party components turned a single cloud failure into broken user experiences (even on websites that weren’t hosted on AWS), and discuss how Eggplant Monitoring can help organizations detect such outages early – before customers are impacted. We’ll also dive into how Eggplant’s real-world synthetic monitoring works (real browser tests, client/server performance data, proactive alerting, etc.), and why it’s crucial for DevOps teams aiming to catch issues in key user journeys (including third-party services) quickly and reliably.

The AWS Outage Impact

In the early hours of October 20th 2025, AWS experienced a major operational issue in its US-EAST-1 region, initially related to DNS resolution for a core database service (Amazon DynamoDB) [1]. The technical issue which occurred in Virginia spread its effects throughout the entire world. The internet experienced platform-wide errors and slow performance which started at 7:30 AM BST. The power outage in Europe caused major disruptions because it occurred during the start of the Monday workday.

Key services affected in the UK and EMEA

Banking & Finance

Customers of major UK banks (Lloyds Bank, Halifax, Bank of Scotland) found themselves unable to access mobile or online banking [2][3]. The outage tracking website Down Detector™ recorded more than 6,900 problem reports about Lloyds by 9:31am [4]. The banking app became inaccessible to users during their sessions because they encountered multiple login problems which resulted in payment transaction rejections [3]. The outage affected traditional banking systems together with digital finance platforms which included Coinbase for crypto exchange and Xero for accounting SaaS and Square for payment processing [2]. This meant everything from personal banking to business payment tools were disrupted.

Government Services

The UK HM Revenue & Customs (HMRC) site experienced issues with users unable to log in or complete tasks [1][4]. Similarly, Department for Work and Pensions (DWP) payment systems faced potential delays [2] – a concerning prospect on a benefits payday. The public services that used AWS cloud hosting and APIs experienced partial service disruptions.

Enterprise Applications

Enterprise tools including Slack experienced system failures which blocked workplace communication for numerous users [1]. The messaging application Signal faced connection problems which disrupted service for its user base [4]. The AWS customer support portal became unavailable which blocked users from submitting support requests [4].

Web Apps & IoT

A long list of popular consumer apps fell victim: Snapchat, Zoom, Amazon.com, Prime Video, Fortnite, Duolingo, Roblox, Ring doorbell cameras and more all reported outages or degraded service [1] [4]. The extensive duration of the outage proved that entertainment services together with IoT devices require cloud-based infrastructure to operate.

Telecom & Other Industries

Some mobile carriers and airline systems saw glitches. There was minor flight delays reported as airline reservation systems had problems [4], and UK telecom providers like Vodafone had service blips. Cloud-based applications made all businesses that used them exposed to security threats.

In those few hours of downtime, the financial and operational impact was huge – not least because so many companies had critical processes tied to AWS.

This outage underlined the fragility of concentrating critical services on a single cloud provider. Many of the UK outages were not caused by those organizations’ own systems failing, but by third-party infrastructure issue outside of their control.

The Third-Party Domino Effect – When “Hidden” Dependencies Fail

The incident showed an unexpected impact because websites and applications running outside AWS infrastructure networks became unavailable. Many organizations might have thought their services were resilient – only to discover that a third-party component in their environment relied on AWS, creating a single point of failure. The main website stayed online but third-party services made essential functions stop working and caused them to become slow.

Consider a few examples of this domino effect of third-party failures:

Banking Multi-Factor Authentication (MFA)

Many banking apps (including those of Halifax and Lloyds) remained online at a basic level – customers could launch the app or webpage – but login wasn’t possible. Why? It appears the two-factor authentication step (such as receiving an SMS code or push notification) was not working. The MFA system and SMS gateway hosted on AWS (or using AWS-dependent services) would experience a time-out or error during this step. The system failed to let multiple users access their accounts because it showed error messages whenever users tried to log in [3]. In practice, the banks’ core systems might have been fine in their own data centres, but if cloud-based authentication components fail they effectively block users from accessing accounts.

Payment Gateways and Transaction Services

An e-commerce website that operates from its own data centre or another cloud platform will maintain normal home page functionality during an AWS outage. However, at checkout, it relies on a payment processor (third-party) like Stripe or PayPal (which in turn have services running on AWS) to complete the transaction. The external payment API stopped working during the outage which caused checkout processes to either get stuck or completely fail. We saw analogous issues in finance: some UK banking customers found that even if they were already logged in, they couldn’t send payments or saw card transactions declined unexpectedly [3]. This hints that back-end payment processing systems (or fraud check services, etc.) failed to respond. In other words, transactions were blocked by a domino effect, even though the user interface was up.

Website Performance and User Experience

Websites implement third-party Content Delivery Networks (CDNs) and cloud storage to distribute static assets including images and scripts and styles because this approach improves website speed and scalability. A lot of these are on AWS (Amazon CloudFront, S3 buckets, etc.). The outage caused websites to become accessible but essential website elements such as product images and CSS and JavaScript files either disappeared or took a long time to load. Users would experience partially broken pages or features. This slowdown happens because the browser keeps trying to contact the unreachable AWS content server; as one expert noted, when DNS fails to find the server, “devices… slow down as they try to locate it, and eventually just stop trying” [4]. The homepage of a website with text content will experience page delays and system failures because its design elements and interactive features depend on files stored in an AWS CDN. Key areas of a site (menus, forms, etc.) might become unresponsive due to one stuck third-party request.

Analytics, Ads, and Other Embedded Tools

Modern websites now contain multiple third-party tools which include analytics trackers and tag managers and maps and social media widgets and live chat support pop-ups and A/B testing scripts and advertising networks. Many of these services run on AWS. The system components either stopped working without warning or they caused page loading times to become excessively long during the outage. A travel website that uses third-party maps API for hotel location display would experience map service failure when hosted on AWS because users would not be able to see the maps. A support chat widget that depends on AWS infrastructure would become inaccessible to customers thus denying them any assistance. The system failures which do not stop core transactions from working still create problems for users because they make the system harder to use and reduce the chances of successful transactions (a user who cannot view product images or check addresses through maps will face difficulties). Moreover, any synchronous third-party script can hold up other page elements. In essence, even if your primary site is up, your users experience the outage via these “hidden” dependencies.

All of these examples illustrate that, even if your own systems stay operational, an outage at a cloud provider can pull the rug out from under critical pieces of your service. Users don’t care why they can’t log in or pay or see content – only that it’s broken. The website fails according to their assessment. Leaning on Gustave Flaubert and updating his quote slightly, “There is no truth, there is only customer perception” This is why it’s vital for IT and DevOps teams to monitor the entire user journey, including third-party calls, rather than only checking if your server is up.

The evaluation shows that organizations face a common vulnerability because they do not protect themselves against threats that come from external business partners. As one expert noted, “so many online services rely upon third parties for their infrastructure, and this shows problems can occur even in the largest providers… small errors can have widespread impact.” [4]. The actual process of incident resolution happens at a speed that exceeds what manual checks can accomplish. Companies need automated eyes on all the moving parts of their user experience to catch issues immediately.

Proactive Outage Detection with Eggplant Monitoring

Eggplant Monitoring provides real-time system performance data which enables organizations to detect outages before their users experience any disruptions.

Eggplant Monitoring is designed exactly for the challenge described above: it provides independent, proactive monitoring of your website’s actual user experience across first party and third-party components. The AWS outage would trigger Eggplant Monitoring to detect essential customer journey breakdowns before customers file complaints or social media platforms begin showing activity. Here’s how:

Eggplant Monitoring monitors real user behaviour across your website instead of performing standard system uptime tests. The tool operates with an entire real browser system (such as Chrome) which enables users to experience their application through the same process as actual human users. The system identifies three types of system failures which include login process breakdowns and checkout button failures and pages that freeze during third-party script loading. An Eggplant monitor would track the bank login process which includes opening the application followed by credential entry and MFA code request and login verification. The MFA step in the synthetic journey would have triggered an error alert when AWS experienced a system failure. The system provides detailed information which goes beyond standard uptime tracking because it shows the full accessibility of all web pages and routes. By testing actual end-to-end transactions, including multi-step flows, Eggplant Monitoring provides an outside-in view of what customers experience.

Independent, Client-Side Perspective

Eggplant performs monitoring from outside your firewall through multiple geographic locations and networks which provides the same user experience as real users would experience regarding performance and availability. Importantly, it performs full independent monitoring of both the client side and server-side aspects of the transaction. The system tracks performance indicators which include page loading duration and browser rendering speed and client-side errors (client/UX metrics such as First Content Packet and DOM load and others). The system faces problems with API requests and HTTP errors and server response issues. Eggplant records API call delays through page load time extensions and error code detection and console error display and element disappearance events. The complete system analysis enables us to determine the source of the problem between server not responsive and external dependency delays. In an outage scenario, such data is golden: teams can quickly tell if a hang is due to their own backend or a hung third-party request.

Proactive Alerts with No False Positives

Eggplant Monitoring will send instant alerts to your team through selected notification methods (email and SMS and webhook integrations with PagerDuty/Slack and more) when system issues occur. The system is tuned to be extremely reliable and noise-free – one of Eggplant’s core principles is “no false-positive” alerts. The system performs "double testing" to confirm errors through automatic retesting which occurs from two independent locations. The system will only trigger an alert when both tests produce negative results which minimizes the occurrence of false alarms. This means that when your ops team gets an Eggplant alert at 7:32am on Monday saying “Critical – Login journey failed at MFA step,” you can trust that it’s a real problem. Engineers need to respond right away to Eggplant Monitoring alerts because the system shows that an investigation is needed. AWS outages need urgent alert systems which provide exact alerts to prevent customer service interruptions.

The system uses Eggplant Monitoring as a monitoring solution which operates beyond basic ping functionality because it stems from testing experience. Some technical highlights:

Full-Browser Monitoring for Realistic User Journeys

It runs real Chrome browsers (not headless) on distributed agents, executing the actual JavaScript and loading all assets in a page. The system enables websites to use modern web standards through its support for SPA frameworks and dynamic content delivery methods. Many simpler monitors or APM tools can’t do this.

Managed, Secure, and Integrated Monitoring-as-a-Service

Eggplant Monitoring is delivered as a managed service. This means the Eggplant team assists in creating and maintaining the monitoring scripts (synthetic user journeys) and adapting them if your application changes over time. The system protects its monitoring infrastructure through multiple security measures which include browser patching and using multiple internet service providers. The system provides active monitoring capabilities which do not require DevOps teams to perform extensive maintenance work. The platform allows tool integration through its dashboard and native mobile app which provides results and webhooks and APIs that deliver alerts to systems including Slack and PagerDuty and CI/CD pipelines for automated incident response. The system operates as a component of contemporary DevOps systems which provide automated instant response capabilities.

Multi-Channel Workflow Testing and End-to-End Validation

It can incorporate complex steps in user journeys – including things like waiting for an SMS code or an email. In fact, Eggplant Monitoring can integrate SMS messages into a user journey (for testing 2-factor auth flows). It’s capable of checking email delivery, file downloads, and other multi-channel aspects of a workflow. The wide range of testing capabilities enables complete system monitoring because Eggplant can verify email delivery through actual inbox reception and time measurement of the entire process.

Deep Diagnostics and Evidence-Based Troubleshooting

The solution offers rich data and diagnostics for each test run. Eggplant monitors HTTP status codes and error messages and page screenshots during each step execution when any step fails. The Eggplant report would display a screenshot of the spinner showing "processing payment" while the network log shows a timeout from api.payment.com. This makes troubleshooting immensely faster – it provides evidence and insight when third parties fail.

Eggplant Monitoring delivers its value through its ability to provide first responder insights which enable continuous monitoring so you can avoid discovering problems through costly experiences. As one Eggplant customer put it, “your monitoring service saves us so much time and gives us the ability to eliminate the problem immediately”. Proactive detection means you can start triaging an AWS outage (or any outage) at 7:30, not 9:00 when the business day is in full swing and users are frustrated.

Strengthening Resilience with Proactive Monitoring

The October 20, 2025, AWS outage was a wake-up call for many organizations in the UK and EMEA. It demonstrated that cloud concentration risk and third-party dependencies are very real operational threats [4]. The best-in-class provider AWS experiences occasional system failures which create broad and random disruptions that affect banks and governments and numerous digital businesses simultaneously.

The core lesson for IT and DevOps professionals requires them to recognize the value of visibility. You need to know immediately when users are unable to complete transactions on your site, and ideally why, even if the cause is outside your own systems. Eggplant Monitoring serves as a solution which provides independent real-user monitoring data that enhances your existing APM and infrastructure monitoring tools. The system functions as an early warning system for incidents through its ability to track third-party components and simulate user journeys which detected the AWS outage.

The AWS outage showed that all applications operate through interconnected digital systems. To protect your user experience (and your brand), you must monitor not just your own applications, but the entire journey your customer takes, including the third-party services along the way. Eggplant Monitoring offers a robust way to do exactly that: continuous testing of your live services, with real browsers and real-world conditions, ensuring you’re the first to know when something breaks. It’s like having a synthetic “user” on duty at all times, clicking and scrolling through your site, and instantly reporting if they encounter a problem.

The AWS outage will repeat itself as an inevitable event which can stem from cloud system problems or content delivery network breakdowns or other types of system failures. Organizations need to allocate resources for active monitoring and preparedness because this method allows them to transform unexpected incidents into manageable events. The goal is that next time “the internet breaks,” your team isn’t in the dark, but already troubleshooting and guiding your users through the storm. Find out more and get your free trial here.

limit
3