What the Facebook blackout should tell you about managing modern IT networks

Facebook’s blackout in October 2021 was an epiphany for a majority of the 5 Billion daily users of Facebook (FB) services across the world, they realized how much their daily life was dependant upon a single application provider and how fragile their connection to the rest of the world was. The blackout only lasted 6 hours, but this was enough to paralyze many real-life activities that are heavily dependent on real time communications through the various FB applications (Facebook, WhatsApp, Instagram…).

Most Facebook traffic is about consumer activities, and to a lesser degree supports business activity. But it is highly likely that the FB outage impacted IT teams, because users unable to connect to their usual services would have complained about the network. IT professionals know that is what end users tend to do, when something doesn’t work they first blame the network.

What really did happen? Turns out it was indeed a network issue, but on Facebook’s network, not on the home workers or enterprise networks! The root cause seems to have been a network configuration change that wiped out all BGP routes announced by the Facebook network, and though routes were immediately regenerated, at the time of re-creation none of the routes to the DNS services for Facebook applications were re-announced to their peers. In turn, DNS services failed because the DNS servers IP addresses on Facebook’s network were not routable anymore from the rest of the world, which resulted into global services being unavailable.

While Facebook network engineers probably spent the most stressful 6 hours of their working life figuring out the problem and fixing it, it is highly likely that most of the rest of us outside of Facebook scratched our heads and wondered if the problem was on our end until the news got broadcasted that Facebook had some issues. Rebooting gateways, smartphones, and opening IT tickets for network problems probably happened more than it should have.

You may also wonder, what if this was a business-critical service that had gone down? A lot of Enterprise business critical apps have now moved to cloud based Software as a Service (SaaS). Much like Facebook services, SaaS services are hosted on some public cloud that is accessible through the Internet and may suffer from blackouts or service degradations that are not under control of IT teams. Think of collaboration tools like Teams and Zoom, desktop applications like Office 365, CRM like Salesforce etc… Problems reaching these services can happen anywhere, from enterprise networks, in end-user remote networks, on their internet links, or at the service platform locations like what happened to Facebook. Whatever the actual cause might be, customers are probably going to complain and blame the network first!

Having visibility into critical cloud hosted application services (aka SaaS applications) health from end-user locations, proactively identifying service degradations, and quickly isolating where the problem is coming from is becoming an important asset for an efficient IT organisation. Most companies now operate with flexible work arrangement, with users connecting from offices or from their home, using tools connected through the internet to third party hosted applications, and these SaaS platforms have become critical for company business.

While there is not much an IT organisation can do when the failure is on the Internet or at SaaS, it is very important to quickly isolate and identify, or rule out, 3rd party services as the cause of business interruption – When the issues turn out to not be under direct IT responsibility, being able to advise end-users and management with confidence is a key asset to reinforce trust in their IT team. In addition, having historical visibility about overall quality and performance to connect to remote applications and being able to prove to 3rd party service or application providers about downtime or degradation of services will be a strategic tool to push for better service levels or to claim penalties for service level agreement violations.

Keysight provides visibility solutions to continuously monitor and troubleshoot connexions to the cloud, SaaS, remote, and internal networks – be ready to quickly triage the cause of the next breakdown, and more importantly, identify the daily hiccups you see happening in your end-users ability to connect efficiently to their SaaS based services. Check us out at the link below to find out more about our solutions !

limit
3