55% of sites don’t use Cache-Control: max-age

Whilst referencing the excellent HTTP archive recently I came across this rather interesting little stat – http://httparchive.org/interesting.php#max-age.
sc-cache1
Over 50% of the assets served by the sites that are crawled by the archive, do NOT set a Cache-Control: max-age value.
My initial thought was that this is pretty shocking really.
When I actually thought about it further however I realised that in all of the Health Checksthat we run for customers, this is something that comes up time and time again.
One of the most common recommendations that we make to site owners / developers is to set a ‘far future’ expiration date for static content, or content that is unlikely to change for some time.
The benefits are twofold (and well documented) – repeat visitors get a much faster experience and the sites saves on bandwidth costs because it’s serving less content.
So why is this value so high?  Should we be surprised that it IS so high?
It possibly serves to highlight a lack of knowledge or understanding around cache control in general.  We often see it (it’s fair to say that over 50% of the health checks that we have completed have made this recommendation in some form or another), so perhaps further education is the solution?
Given all of the ‘noise’ generated by WPO in the recent months I was sure that this number would have considerably reduced since Steve first posted this.
It’s reduced by 1%.

Why Histograms are cool!

Quite a while ago we added Histograms into our monitoring portal, and ever since then I’ve always found them to be incredibly useful!
I was recently talking to a prospect (who didn’t have as much of a passion for stats as we do at SC Towers) about the value of Histograms and they were questioning in what situation they should be using them.  We had been talking about longer term analysis of performance data, specifically identifying when and to what degree their page performance varied from the average.
Quite naturally, they had been using the Daily Average report, showing mean, min and max values.  This report is useful but the one thing it doesn’t allow you to investigate is the spread of the test results which had strayed beyond the mean.
Let me give you an example.
Using the Daily Average Speed report to look at performance data over a 7 day period gives you the following:
hist1
This gives a good indication of the mean variation throughout the week, and shows us just how bad (by bad I mean how far from the average) some of the results have been, but it doesn’t tell us how many of the results were on or near to the max.
By using a Histogram we can determine exactly what the spread of results is.  The histogram below was generated for the same page monitor over the same time period.
hist2
 
Here we can see that although the majority of test results have been placed in the 8-9 seconds ‘silo’, which is in line with the average reported above, almost 40% of the total number of results returned are outside of this average (either because the page returned quicker than expected or because it was much slower to respond with a full page load).
The gamma distribution on the histogram below is much more severe, but could quite easily be produced using the same or similar data from the Average Speed report at the top of this blog. 
hist4
This shows us that averages can hide a lot of sins!
So, why are histograms cool?  Well, in my mind anything that gives me some extra information is cool.  So the fact that these histograms have shown me the variance in these test results is a good thing.
They’re also cool because they can show things like this.  The Histogram below is definitely what you don’t want to see being produced by your platform, as it is highlighting massive performance variations, in this case at different times of the day.  During the quiet periods the site is relatively stable, but then as things get busier the site starts to slow down resulting in the ‘rock horns’ effect.
hist3
 hist5
Some Rock Horns!
And anything that makes the rock horns is cool by me!

80/20 Follow up observations

A few weeks ago we looked at the 80/20 argument as presented by Steve Souders, and reasoned that the backend time might actually be greater if you take the response times into account for every object on the page, rather than just the initial request.
Of course, as Pat Meenan rightly pointed out, it should be about understanding the full picture (front and backend) of your own performance.  To my mind that full picture extends to identifying performance bottlenecks across the entire page download.
In last week’s blog post we showed how the backend response times can amount to a significant number, skewing the 80/20 assumption.  This quick calculation was based on those backend times being consistent.
We have observed that those requests are not always consistent, and that variance at the object level will have a knock-on effect on total page downloads and therefore user experience.
 
In the example below, the data start time for a single static asset served on a simple page is varying between 0.20 and 0.40 seconds.  This may not look like much but given that (for example) the average number of images served per page is currently 54 (http://httparchive.org/trends.php#bytesImg&reqImg) this variance can start to mount up.
sc8020-1
 
In this instance we can see that the data start time (and to a lesser extent SSL connection time, although at least this measurement is consistent) is causing the slow object download time. 
 
Sometimes the cause of the variation isn’t quite as clear however.
In the 2 waterfall graphs below, some of the components on the page seem to take a varying amount of time to download for each test.  For example, at times bannerInfo-bg.png takes 1.2 seconds, and yet earlier in the day it only took 0.102 seconds.
Sample Waterfall 1
 sc8020-2
Sample Waterfall 2
 

sc8020-3

We can also recreate this in the real world using HTTP Watch.
sc8020-4

First test

sc8020-5

Second test

 
By far the largest variation is for the Receive time.  Over 0.647 difference to download an object that is under 3Kb.  What could be causing this variation?
So this has now become a great example of when it would be useful to start cross-referencing internal measurements (e.g. from  APM solutions to determine how long it takes the application to start sending the data or from network tools to try and understand why some objects take longer than expected to be transmitted) against external performance monitoring data.
Armed with this end-to-end view we can build up a picture and some understanding of the performance of systems serving nominally static content and whether it is truly variable in and of itself or whether it appears to be affcted by unfavourable external conditions such as the poor performance of intervening networks.
Although the real world test was conducted using a local office network where there could be local contention, it has been repeated countless times in the ‘clean room’ environment of the monitoring service as well, so we should be able to discount local causes as the reason for the varations.
The important take away from all of the examples above is that you need to have full visibility of the entire application delivery process, and that without accurate measurements you have no idea of a) the current situation and b) if optimisation work / bug fixing /general maintenance is having a positive or negative impact.

Whilst there is a danger of getting carried away with just the numbers (one of Deming’s “7 Deadly Diseases of Management” was to run a company on visible figures alone- http://en.wikipedia.org/wiki/W._Edwards_Deming) measurement has to be one of the cornerstones of good web performance management.