As you might have heard, storage servers belonging to a well-known cloud computing services provider were down for a time on 28 February. Given its popularity, it wasn’t surprising that this affected a large number of websites.
Of course, these things happen from time to time. We’ve got so used to the Internet being so reliable that big outages inevitably hit the headlines, but they remain something we have to be prepared for.
And that’s the point. The infrastructure on which your website depends is always at risk.
So the question becomes ‘what can you do to mitigate that risk?’
1) Be the first to know when something goes wrong
This is why you have synthetic monitoring. This is why you’ve set up alerts. You don’t want to be taken by surprise when customers start complaining. Time is money, and you need to be able to move to put any contingency plans into operation at the earliest possible opportunity.
2) Understand the impact
Almost all websites depend on content from multiple servers and from multiple providers. Retail and news sites in particular tend to use a lot of third parties for advertising, retargeting, analytics, social media integrations, live chat and dynamic content. And it won’t always be immediately obvious where all that content is coming from.
At best, an outage such as this will mean that some of those third parties aren’t able to deliver their usual service.
We looked at the effect on a number of the UK’s top retail sites (using our synthetic monitoring service). In the example shown below, requests for the Twitter widget script were timing out (data start times are represented by the blue areas in this graph, with object size represented by the grey area).
Fortunately, third-party single points of failure (SPOFs) are now rare. The above script, for example, was loaded asynchronously. Visitors would have seen the loading spinner for a long time and would have missed out on content from Twitter, but the rest of the content on the page was delivered without a problem.
Of course, SPOFs aren’t limited to third-party content. First-party content is often hosted in a number of different places – for example, when a subsidiary company has its own infrastructure, but delivers certain content from the parent organisation.
One retail website lost a number of key images during last night’s outage – the following graph shows how one of them was unavailable for around three hours:
Another had to make do without its custom fonts:
This would have been a problem in those browsers that wait for a custom font to load before displaying the text it applies to. The potential for fonts to become, practically speaking, a SPOF is also compelling reason for browsers to adopt the font-display CSS property. This, broadly, allows the site owner to decide whether the browser should wait for a font to load or show ‘flash of unstyled text’.
3) Be ready to react
Knowing there’s a problem is one thing. Being able to do something about it is quite another. For example, simply removing the reliance on custom fonts in the above example might have been a good temporary fix, protecting the user experience.
We did see an example of one third-party provider apparently pointing an asset’s URL to a new location. This is evidenced in the graph below by the relative brevity of the outage (represented by the long data start times in blue), after which there is a small change in the size of the object (shown in grey).
While many UK retail sites coped well with the outage, thanks to asynchronously loaded third-party scripts, it looks as though a few were still caught out. It’s therefore important for site owners to understand where their single points of failure lie and to have contingency plans in place in the event of an outage.