In 2005 a seminal article by John Ioannidis argued that various biases in how science is conducted, such as the use of small sample sizes and emphasising on novel, eye-catching findings, conspire to reduce the likelihood that a piece of published research is in fact correct. Since then, there has been growing interest in what has become known as the reproducibility crisis, stimulated in part by growing empirical evidence that many published research findings cannot be replicated. For example, in 2011 scientists from the pharmaceutical company Bayer reported that they were only able to replicate ~20-25% of results published in academic journals. Interest in the question of what proportion of published research findings are actually true, and whether we can do better, has grown – in 2015 the Academy of Medical Sciences in the UK held a symposium on the topic, while the House of Commons Science and Technology Committee is currently undertaking an inquiry on these issues.
Bristol University has a long-standing tradition of meta-research and evidence synthesis, methods which provide the basis for the study of issues of reproducibility. It’s therefore not surprising that we have been active in the ongoing reproducibility debate – for example, in 2013 we published an analysis suggesting that many studies in the neuroscience literature may be too small to provide reliable results (a finding we recently replicated across a wider range of biomedical disciplines). In collaboration with economists, we also showed that the current peer review system may lead to ‘herd behaviour’, which increases the risk that scientists will converge on an incorrect answer. Since then, there has been growing interest in reproducibility from funders and journals (although, interestingly, less from institutions), and increasing acceptance of the possibility that science may not be functioning optimally. What is at the heart of the problem?
Part of the issue may be that scientists are also human, and however well trained and well motivated they are, they will still be subject to the same cognitive biases we all are prone to, such as confirmation bias and hindsight bias. We also want our work to be recognized, through publication and the award of grants, and these also lead to personal career advancement and esteem. To what extent do these pressures, and the incentive structures scientists work within, shape our behaviour?
There is some evidence that they do – we have found that studies conducted in the US, where academic salaries are often not guaranteed if grant income is not generated, tend to over-estimate effects compared to studies conducted outside the US (a result later replicated, at least for the “softer” biomedical sciences). Even more worryingly, studies published in journals with a high Impact Factor – a widely-used metric often used as a proxy for quality – also seem to be more likely to over-estimate effects. Indeed, Impact Factor appears to correlate more strongly with the likelihood of retraction than with the number of citations an article receives. Together with colleagues in biological sciences, we modelled the current scientific “ecosystem” and showed that current incentives that prioritise a small number of novel, eye-catching findings (i.e., those published in journals with a high Impact Factor) are likely to result in studies that are too small, and give unreliable findings. This strategy is optimal for career advancement in the current system, but not optimal for science.
What can we do? One solution is to offer better training, and in particular to make researchers (especially early career researchers) aware of the kinds of bias that can unconsciously distort their interpretation of their own data. We recently received funding from the BBSRC to run a residential workshop for early career researchers on advanced methods for reproducible science. We also published a Manifesto for Reproducible Science, where we argue for the adoption of a range of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. Critically, many of these measures can be adopted by individual researchers or research groups, although others will require the engagement of key stakeholders – funders, journals and institutions. There may also be discipline-specific measures that can be taken – we recently outlined a range of measures intended to improve the reliability of findings generated by functional neuroimaging research.
Ultimately, the issue is one of quality control – at a workshop convened by the CHDI Foundation to discuss these issues, it was pointed out that most scientific findings are produced to be “fixed” (i.e., confirmed or disconfirmed) later. An analogy was made with the US automobile industry in the 1970s, where productivity was high but quality control very poor. It was the era of the “lemon”. The Japanese automobile industry took advice from the US statistician Edwards Deming, and introduced the concept of quality control measures at every stage of the production pipeline. It still has a reputation for reliability today. Perhaps we need to do something similar with biomedical science.