We have been told in no uncertain terms by experts in The American Statistician (2016), Nature (2019) and now the Significance Magazine that p-values don’t matter (much). This is slightly awkward when we have only just managed to convince various partners in applied energy demand projects that robust sample design, recruitment and statistical analyses are crucial to making generalisable inferences and evidence-based decisions. It is awkward because in the energy research ecosystem we inhabit, old-money ‘statistical significance p-values’ are still the truth indicators of choice.

In general we have found it hard enough to get across the value of (stratified) random sampling (*no self selection please*) and random allocation to trial groups (*no systematic bias or regression to the mean please*) never mind challenging the ‘statistical significance p-values’ mental model. As a result, we have rarely scaled the apparently complicated crags of effect sizes and confidence intervals in these conversations. But we know we must do better and we know we have to take our practitioners with us because **this really does matter to their decision-making**…

Why? Because, as all of the statistical articles have taken pains to point out, p-values are simply not enough. Not enough in empirical science but also not enough in deciding which energy demand intervention might be a good investment.

But changing mental models and expectations is hard so, sometime after we’ve done the ‘sampling talk’ at project design stage, we run a thought experiment to demonstrate the utility of effect sizes and confidence intervals *without mentioning p-values at all*. The experiment goes something like this.

## Before we start

We begin with the idea that trials should be designed (sized) to test an effect size that a business case has calculated would be needed for an intervention to be worthwhile.

- Suppose ElecCo calculated that installing an Internet of Things ‘smart demand response management’ widget in each customer dwelling would need to reduce mean power demand in the winter evening peak period (16:00 – 20:00) by 20% for it to be worthwhile (cost of install vs benefit to network etc);
- So we’d need to design a trial to test if an effect size of 20% can be achieved. There is no point designing it to test 10% as the business case falls at this effect size anyway;
- This means we need to do appropriate statistical power analysis to make sure our sample size is up to the task;
- Suppose we did this and ran the trial. Now:
- If the trial showed an effect size (reduction) of 26% with (say) a 95% Confidence Interval of 21%-31% then ElecCo is happy – if they repeated the trial 100 times then they’d expect the widget to beat the business case threshold 95% of the time (not quite what a CI means but it will do for these purposes…);
- If the trial showed 10% (95% CI 5%-15%) then ElecCo is sad (the widget didn’t do what they needed) but happy because the trial has stopped them wasting money on a widget and all it’s associated service costs that won’t make their business case work;
- If the trial showed 18% (95% CI 13% – 23%) then ElecCo would need to make a risk-based call because there is a smaller chance their business case threshold of 20% would be reached. But it might. In general, this is the usual situation for real-world decision making.

So the question ElecCo needs to answer before designing a trial is: *what would that business case % look like for the widget they have in mind*? This is not always an easy question to answer.

At this point ElecCo colleagues tend to ask “well… err… OK, but what happened to statistical significance?” Our response is twofold:

- We note that because none of the 95% CI in the three results scenarios above include 0%, they would all be considered ‘statistically significant’ in old p-value money. But.
**Exactly how useful would that have been for a decision maker**?

Then, to bring the conversation back to where we want it, we extend the thought experiment one possible result further:

- If the trial showed 4% (95% CI -1% – 9%) then we see that our 95% CI includes 0. In old p-value money we would be accepting the null hypothesis that the widget had no effect at p < 0.05. We also have an effect that is of no practical business use.

## When the results are in

Following directly from the thought experiment, our friends at ElecCo are now conditioned to expect us to report three things:

- average effect size: what is the
*average bang for buck*? - effect size confidence intervals:
*how uncertain is the bang*? - the
*p*value:*what is the risk of a false positive*?

Then and only then do we start to talk about what we can infer from the results.

## How all this helps

We have found this thought experiment offers guidance to researchers who have to design and assess such studies; project managers who need to understand what can count as evidence, for what purpose and in what context and decision makers who need to make defensible commercial or policy decisions based on the balance of evidence and probability.

We have also found it helps all stakeholders to distinguish the search for statistical significance and other dubious practices, from the requirement for actionable evidence.

## Further reading

This blog is a (very) short summary of Ben Anderson, Tom Rushby, Abubakr Bahaj, and Patrick James. 2020. ‘**Ensuring Statistics Have Power: Guidance for Designing, Reporting and Acting on Electricity Demand Reduction and Behaviour Change Programs**’. *Energy Research & Social Science* 59 (January): 101260. https://doi.org/10.1016/j.erss.2019.101260. (open access)

This post was originally published on Ben Anderson’s personal blog.