There is a reproducibility crisis in psychology and we need to act on it









The Müller-Lyer illusion: a highly reproducible effect. The
central lines are the same length but the presence of the fins induces a
perception that the left-hand line is longer.







The debate about whether psychological research is
reproducible is getting heated. In 2015, Brian Nosek
and his colleagues in the Open Science Collaboration
showed that they could
not replicate effects for over 50 per cent of studies published in top journals.
Now we have a
paper by Dan Gilbert and colleagues
saying that this is misleading because
Nosek’s study was flawed, and actually psychology is doing fine. More
specifically: “Our
analysis completely invalidates the pessimistic conclusions that many have drawn
from this landmark study
.” This has stimulated a set of rapid responses,
mostly in the blogosphere. As Jon Sutton
memorably tweeted:
“I guess it's possible the paper that says the paper
that says psychology is a bit shit is a bit shit is a bit shit.”


So now the folks in the media are confused and don’t know
what to think.


The bulk of debate has been focused on what exactly we mean
by reproducibility in statistical terms. That makes sense because many of the
arguments hinge on statistics, but I think that ignores the more basic issue,
which is whether psychology has a problem. My view is that we do have a
problem, though psychology is no worse than many other disciplines that use
inferential statistics.


In my undergraduate degree I learned about stuff that was on
the one hand non-trivial and on the other hand solidly reproducible. Take for
instance, various phenomena in short-term
memory
. Effects like the serial position effect, the phonological
confusability effect, the superiority of memory for words over nonwords, are
solid and robust. In perception, we have striking visual effects such as the Müller-Lyer
illusion
, which demonstrate how our eyes can deceive us. In animal learning, the
partial reinforcement effect is solid. In psycholinguistics, the difficulty
adults have discriminating sound contrasts that are not distinctive in their
native language
is solid. In neuropsychology, the dichotic
right ear advantage for verbal material
is solid. In developmental psychology,
it has been shown over and over again that poor readers have deficits
in phonological awareness
. These are just some of the numerous phenomena
studied by psychologists that are reproducible in the sense that most people
understand it, i.e. if I were to run an undergraduate practical class to
demonstrate the effect, I’d be pretty confident that we’d get it. They are also
non-trivial, in that a lay person would not just conclude that the result could
have been predicted in advance.


The Reproducibility Project showed that many effects
described in contemporary literature are not like that. But was it ever thus? I’d
love to see the reproducibility project rerun with psychology studies reported
in the literature from the 1970s – have we really got worse, or am I aware of the
reproducible work just because that stuff has stood the test of time, while
other work is forgotten?


My bet is that things have got worse, and I suspect there
are a number of reasons for this:


1. Most of the phenomena I describe above were in
areas of psychology where it was usual to report a series of experiments that
demonstrated the effect and attempted to gain a better understanding of it by
exploring the conditions under which it was obtained. Replication was built in
to the process. That is not common in many of the areas where reproducibility of
effects is contested.


2. It’s possible that all the low-hanging fruit has
been plucked, and we are now focused on much smaller effects – i.e., where the
signal of the effect is low in relation to background noise. That’s where
statistics assumes importance. Something like the phonological confusability
effect in short-term memory or a Müller-Lyer illusion is so strong that it can
be readily demonstrated in very small samples. Indeed, abnormal patterns of performance
on short-term
memory tests can be used diagnostically with individual patients.
If you
have a small effect, you need much bigger samples to be confident that what you
are observing is signal rather than noise. Unfortunately, the field has been
slow to appreciate the importance of sample size and many studies
are just too underpowered to be convincing
.



3. Gilbert et al
raise the possibility that the effects that are observed are not just small but
also more fragile, in that they can be very dependent on contextual factors.
Get these wrong, and you lose the effect. Where this occurs, I think we should
regard it as an opportunity, rather than a problem, because manipulating
experimental conditions to discover how they influence an effect can be the key
to understanding it. It can be difficult to distinguish a fragile effect from a
false positive, and it is understandable that this can lead to ill-will between
original researchers and those who fail to replicate their finding. But the
rational response is not to dismiss the failure to replicate, but to first do
adequately powered studies to demonstrate the effect and then conduct further
studies to understand the boundary conditions for observing the phenomenon. To
take one of the examples I used above, the link between phonological awareness
and learning to read is particularly striking in English and less so in some
other languages. Comparisons
between languages
thus provide a rich source of information for
understanding how children become literate. Another of the effects, the right
ear advantage in dichotic listening holds at the population level, but there
are individuals for whom it is absent or reversed. Understanding this
variability is part of the research process.


4. Psychology, unlike many other biomedical disciplines,
involves training in statistics. In principle, this is thoroughly good thing, but
in practice it can be a disaster if the psychologist is simply fixated on
finding p-values less than .05 – and assumes that any effect associated with
such a p-value is true. I’ve blogged about this extensively, so won’t repeat
myself here, other than to say that statistical
training should involve exploring simulated datasets
so that the student
starts to appreciate the
ease with which low p-values can occur by chance
when one has a large
number of variables and a flexible approach to data analysis. Virtually all
psychologists misunderstand p-values associated
with interaction terms
in analysis of variance – as
I myself did
until working with simulated datasets. I think in the past
this was not such an issue, simply because it was not so easy to conduct
statistical analyses on large datasets – one of my early papers
describes how to compare regression coefficients using a pocket calculator,
which at the time was an advance on other methods available! If you have to put
in hours of work calculating statistics by hand, then you think hard about the
analysis you need to do. Currently, you can press a few buttons on a menu and
generate a vast array of numbers – which can encourage the researcher to just
scan the output and highlight those where p falls below the magic threshold of
.05. Those who do this are generally unaware of how problematic this is, in
terms of raising the likelihood of false positive findings.


Nosek et al have demonstrated that much work in psychology
is not reproducible in the everyday sense that if I try to repeat your
experiment I can be confident of getting the same effect. Implicit in the critique
by Gilbert et al is the notion that many studies are focused on effects that
are both small and fragile, and so it is to be expected they will be hard to
reproduce. They may well be right, but if so, the solution is not to deny we
have a problem, but to recognise that under those circumstances there is an
urgent need for our field to tackle the methodological issues of inadequate
power and p-hacking, so we can distinguish genuine effects from false
positives.







Comments

Popular posts from this blog

Open code: not just data and publications

Data from the phonics screen: a worryingly abnormal distribution

Has the Society for Neuroscience lost its way?