The other day I was in a meeting looking at sales numbers. We were comparing performance of two different queues of leads, and someone said something like “Clearly, Queue A is giving us a better yield.”
The numbers were pretty small and we only had a week’s worth of data. “I don’t know,” I said, “is that difference really statistically significant?”
“Of course it is. It’s twice as big”
In the moment I let it go, but I knew that it wasn’t a true assessment of statistical significance. Let’s say there were only 4 names in each queue. If Queue A gave us 1 lead and Queue B gave us 2 leads, then “twice as many” wouldn’t feel like a conclusion. Conversely, if there 1000 names in each queue, and one gave us 400 leads while the other gave us 800 leads, then we could comfortably draw the conclusion. But what about all the in-betweens?
All of this drove me to do some research on what “statistical significance” really means. As I’ve been learning, much of marketing is actually pretty numbers-driven, so knowing what numbers “matter” is important to making good decisions.
Here’s the first definition that came up, from Wikihow. “Statistical significance is the number, called a p-value, that tells you the probability of your result being observed, given that a certain statement (the null hypothesis) is true. If this p-value is sufficiently small, the experimenter can safely assume that the null hypothesis is false.”
OK, first I teased out what “null hypothesis” means – it’s the baseline that assumes that there is no impact of the variable, or no difference in two populations. In my example, the null hypothesis would be that the yield of Queue A is the same as the yield of Queue B. Any difference in their yields is based purely on randomness.
As I read more, there seem to be three interrelated concepts that feed into whether something is significant:
Sample Size: How many activities are we looking at?
How big are each of the queues?
Confidence Range: How precisely honed in on the conclusion are we?
We’re comfortable saying that if the yields are within 5% of each other then they are, for practical purposes, the same.
Confidence Level: How sure are we about our conclusion?
We’re 90% sure that the yields being different means that in the larger population they will be different (and thus we should/shouldn’t make a decision based on this)
So the way these interact is: If you want a higher confidence level (i.e., to be more sure of your conclusion) then you have to accept a larger confidence interval (i.e., accept a greater range, like 8% rather than 5%). To make that interval lower, then you need a larger sample size.
Statistical significance means that we are at least 95% that the results are due to the nature of the different populations, not to randomness.
Getting back to our example, then the thing we’d be testing is “Is the yield from Queue A greater than that of Queue B.” We’d define “greater than” as calculating the interval of both results and checking that they don’t overlap. And we’d need a big enough sample size to be 95% sure that these results were repeatable.
One thing that helped me understand this more concretely is this calculator provided by KISSMetrics.
Let’s say I have a sample size of 100.
If Queue A yields 90 and Queue B yields 80, then with 98% certainty we can say Queue A is better.
But let’s say Queue A yields 40 and Queue B yields 50. Then the certainty is only 92%, which is not considered “statistically significant.” We’re only 92% sure that this is not due to randomness.
My take-home conclusion is that of where I started: it’s not always obvious whether something is “statistically significant” without doing some serious math.