For a while I wanted to write this post about the Google Analytics sampling. You know, the dreaded message that appears on top of your reports:
This message shows up when you work with a dataset that contains more than 500.000 visits or more than 1.000.000 items (keywords/url's/etc). Above that Google takes a sample of all those visits to calculate the numbers for your reports. But what is acceptable? In this example Google uses 30.62% of all visits to guess what the other 70% did on my site...
Example 1, very large site, 1% sample size
Imagine you want to analyse a page for a cool new product you launched a while ago. You open up the "All Pages" report and search for that page:
Well, the numbers look a bit odd, don't they? No bounces, no entrances, no exits, and all visitors looked at the page once per visit (pageviews matches unique pageviews exactly). Let's raise the sample size to "Higher precision":
Now I see these numbers:
Suddenly we have entrances, and a bounce and exit rate for exactly the same page. I'm glad to see 321 people found their way to this page through an external source. But what would happen if I could raise the sample size to 1.000.000? I'll never know.
Example 2: medium site, 30% sample size
Let's do it again like the example above. I looked for a specific URL and I see this:
Now we raise the sample size to "Higher precision" again:
As you can see the numbers are much closer to each other.
What sample size is acceptable
The big question here is: where do you stop trusting the numbers. Google says that "needle in a haystack" analysis is difficult when hitting the sample treshold, and I agree. But a 30% or 60% sample size seems pretty reliable, 1% not. Somewhere in the middle is the absolute minimum. Perhaps a good statistician can shed some light upon this?
To overcome these problems there are a few solutions:
- Download your data trough the Google Analytics API and store it in your data warehouse (if you have that). That gives you some more reliability in a day to day analyses.
- Upgrade Google Analytics to it's paid version: Premium
- Create multiple trackers per subfolder with the setCookiePath command. That way you have an account per subfolder that won't hit the sampling treshold as soon as the main profile.
Any statistician want to comment? What is the lowest acceptable sample size?
Any webanalist found/uses other solutions?
How do you deal with sampling?