Nine months of hard work on something you’ll hopefully never notice

In some ways, we’re like the people who keep the lights on or who make sure the supermarkets are stocked with food. We work very hard so that our customers can take the smooth-running of our services for granted. And as long as we do our job well, you’ll never notice the work that goes on behind the scenes to keep our systems as resilient and as reliable as possible.

So exactly what does go on behind the scenes?

One good example is a recently completed nine-month project to migrate our Performance Monitoring service to a new platform. We’ve just deployed the fourth generation of a highly available, dedicated environment that delivers 24×7 testing of our customers’ websites.

As with the previous platform, we will be capable of supporting modular failover from a primary data centre in Docklands to a DR site outside of the London metropolitan area, protecting us from failures of internet connectivity or server hardware, and major incidents in either of the data centres.

Why did we need the new platform?

This one’s easy to answer once you know just how much changed during the lifetime of its predecessor:

•   The number of web pages downloaded increased from 1.7 million to almost 3 million a day.
•   Bandwidth utilised increased from 40Mbps to 180Mbps.
•   Total content downloaded increased from 600GB to 1.8TB per day.

The rate of change is part of what motivated us to the new, more robust platform, and we’ve planned to see a further threefold increase in monitoring activity over its lifetime.

How did we do it?

A great deal of effort went into the specification of both the hardware and software for the platform, and we’re confident we’ve built a solution that is scalable, upgradeable and as resilient as we can make it. We’ve invested heavily in hardware, and we’ve worked hard with our suppliers to ensure they deliver the ‘best bang for our buck’.  We’ve made efficiency improvements through the extensive use of blade servers and switches, and greater throughput is possible thanks to the move to a 10Gb/s network across all devices and between data centres.

We’ve also spent hundreds of hours testing database performance, virtualisation options, software compatibility and failover scenarios. At the same time, we’ve been ensuring complete automation of server build, configuration and software deployment, leveraging open source tools such as Cobbler and Puppet. CheckMK will be deployed and configured to provide all of our internal monitoring, providing alerting of any incidents to our Operations team around the clock. Also, a fully integrated out of band management capability exists across the entire platform. Managed centrally, this enables such features as automatic ordering of spare hardware under warranty in response to detected failures as well as dashboard views of hardware and network status of all connected devices.  Finally, we’ve migrated all of our source code from SVN to a Git repository and now generate RPMs for the majority of our software in an effort to ease deployment and ensure consistency of releases.

What difference will it make?

The short answer is that customers shouldn’t notice any difference. Part of what’s always set us apart is our stable, reliable testing platform. The change is really about ensuring that we can continue to offer the same level of service well into the future. But if you’d like to know more about the detail of this or any other service upgrade, please check the forum (and post your question if the answer’s not already there).

Simon Austin

simonSimon Austin is the Technical Director at NCC Group Web Performance, taking overall responsibility for the development, delivery and support of its services. This will be the third upgrade of this particular platform he’s overseen since joining the company (then Site Confidence) back in December 2001, when his role was to provide technical sales to a fledgling start up. The company has come a long way since those heady days of the dotcom boom and bust, seeing dramatic increases in the scale, capability and reliability of its platforms required to match those of its customer activity and online activity in general.

Leave a Reply

Your email address will not be published. Required fields are marked *