Some thoughts on the Statcheck project
Yesterday, a piece in Retractionwatch covered a new study, in which results of automated statistics checks
on 50,000 psychology papers are to be made public on the PubPeer website.
I had advance warning, because a study of mine
had been included in what was presumably a dry run, and this led to me
receiving an email on 26th August as follows:
Assuming someone had a critical comment on this paper, I
duly clicked on the link, and had a moment of double-take when I read the
comment.
Now, this seemed like overkill to me, and I posted a rather
grumpy tweet about it. There was a bit of to and fro on Twitter with Chris Hartgerink, one
of the researchers on the Statcheck project, and with the folks at Pubpeer, where I explained why I was grumpy and they defended their approach; as
far as I was concerned it was not a big deal, and if nobody else found this odd,
I was prepared to let it go.
But then a couple of journalists got interested, and I sent them a more detailed thoughts.
I was quoted
in the Retraction Watch piece, but I thought it worth reporting my response in
full here, because the quotes could be interpreted as indicating I disapprove
of the Statcheck project and am defensive about errors in my work. Neither of
those is true. I think the project is an interesting piece of work; my concern
is solely with the way in which feedback to authors is being implemented. So
here is the email I sent to journalists in full:
I am in general a strong supporter of the reproducibility
movement and I agree it could be useful
to document the extent to which the existing psychology literature contains
statistical errors.
However, I think there are 2 problems with how this is being
done in the PubPeer study.
1. The tone of the
PubPeer comments will, I suspect alienate many people. As I argued on Twitter,
I found it irritating to get an email saying a paper of mine had been discussed
on PubPeer, only to find that this referred to a comment stating that zero
errors had been found in the statistics of that paper.
I don't think we need to be told that - by all means report
somewhere a list of the papers that were checked and found to be error-free,
but you don't need to personally contact all the authors and clog up PubPeer
with comments of this kind.
My main concern was that during an exceptionally busy period, this was just another
distraction from other things. Chris Hartgerink replied that I was free to
ignore the email, but that would be extremely rash because a comment on PubPeer
usually means that someone has a criticism of your paper.
As someone who works on language, I also found the pragmatics
of the communication non-optimal. If you write and tell someone that you've
found zero errors in their paper, the implication is that this is surprising,
because you don't go around stating the obvious*. And indeed, the final part of
the comment basically said that your work may well have errors in it and even
though they hadn't found them, we couldn't trust it.
Now at the same time as having that reaction, I appreciate
this was a computer-generated message, written by non-native English speakers,
that I should not take it personally, and no slur on my work was intended. And
I would like to know if errors were found in my stats, and it is entirely
possible that there are some, since none of us is perfect. So I don't want to
over-react, but I think that if I, as someone basically sympathetic to this
agenda, was irritated by the style of the communication, then the odds are this
will stoke real hostility for those who are already dubious about what has been
termed 'bullying' and so on by people interested in reproducibility.
2. I'll be interested to see how this pans out for people
where errors are found.
My personal view is that the focus should be on errors that
do change the conclusions of the paper.
I think at least a sample of these should be hand-checked so
we have some idea of the error rate - I'm not sure if this has been done, but
the PubPeer comment certainly gave no indication of that - it just basically
said there's probably an error in your stats but we can't guarantee that there
is, putting the onus on the author to then check it out.
If it's known that on 99% of occasions the automated check
is accurate, then fine. If the accuracy is only 90% I'd be really unhappy about
the current process as it would be leading to lots of people putting time into checking their papers on the basis
of an insufficiently sensitive diagnostic. It would make the authors of the
comments look frankly lazy in stirring up doubts about someone's work and then
leaving them to check it out.
In epidemiology the terms sensitivity and specificity are
used to refer to the accuracy of a diagnostic test. Minimally if the
sensitivity and specificity of the automated stats check is known, then those
figures should be provided with the automated message.
The above was written before Dalmeet drew my attention to
the second paper, in which errors had been found. Here’s how I responded to
that:
I hadn't seen the 2nd paper - presumably because I was not
the corresponding author on that one. It's immediately apparent that the problem is that F ratios
have been reported with one degree of freedom, when there should be two. In
fact, it's not clear how the automated program could assign any p-value in this
situation.
I'll communicate with the first author, Thalia Eley, about
this, as it does need fixing for the scientific record, but, given the sample
size (on which the second, missing, degree of freedom is based), the reported
p-values would appear to be accurate.
I have added a comment to this effect on the PubPeer site.
* I was thinking here of Gricean maxims, especially maxim of relation.
Comments
Post a Comment