Leaders The Economist October igth 2013 How science goes wrong Scientific research has changed the world. Now it needs to change itself ASIMPLE idea underpins science: "trust, but verify". Results should always be subject to challenge from experiment. That simple but powerful idea has generated a vast body of knowledge. Since its birth in the 17th century, modern science has changed the world beyond recognition, and overwhelmingly for the better. But success can breed complacency. Modern scientists are doing too much trusting and not enough verifying-to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis (see pages 26-30). A rule of thumb among biotechnology venture-capitalists is that half of published research cannot be replicated. Even that may be optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 "landmark" studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties. What a load of rubbish Even when flawed research does not put people's lives at risk-and much of it is too far from the market to do so-it squanders money and the efforts of some of the world's best minds. The opportunity costs of stymied progress are hard to quantify, but they are likely to be vast. And they could be rising. One reason is the competitiveness of science. In the 1950s, when modern academic research took shape after its successes in the second world war, it was still a rarefied pastime. The entire club of scientists numbered a few hundred thousand. As their ranks have swelled, to 6m-7m active researchers on the latest reckoning, scientists have lost their taste for self-policing and quality control. The obligation to "publish or perish" has come to rule over academic life. Competition for jobs is cut-throat. Full professors in America earned on average $135,000 in 2012—more than judges did. Every year six freshly minted phDs vie for every academic post. Nowadays verification (the replication of other people's results) does little to advance a researcher's career. And without verification, dubious findings live on to mislead. Careerism also encourages exaggeration and the cherry-picking of results. In order to safeguard their exclusivity, the leading journals impose high rejection rates: in excess of 90% of submitted manuscripts. The most striking findings have the greatest chance of making it onto the page. Little wonder that one in three researchers knows of a colleague who has pepped up a paper by, say, excluding inconvenient data from results "based on a gut feeling". And as more research teams around the world work on a problem, the odds shorten that at least one will fall prey to an honest confusion between the sweet signal of a genuine discovery and a freak of the statistical noise. Such spurious correlations are often recorded in journals eager for startling papers. If they touch on drinking wine, going senile or letting children play video games, they may well command the front pages of newspapers, too. Conversely, failures to prove a hypothesis are rarely even offered for publication, let alone accepted. "Negative results" now account for only 14% of published papers, down from 30% in 1990. Yet knowing what is false is as important to science as knowing what is true. The failure to report failures means that researchers waste money and effort exploring blind alleys already investigated by other scientists. The hallowed process of peer review is not all it is cracked up to be, either. When a prominent medical journal ran research past other experts in the field, it found that most of the reviewers failed to spot mistakes it had deliberately inserted into papers, even after being told they were being tested. If it's broke, fix it All this makes a shaky foundation for an enterprise dedicated to discovering the truth about the world. What might be done to shore it up? One priority should be for all disciplines to follow the example of those that have done most to tighten standards. A start would be getting to grips with statistics, especially in the growing number of fields that sift through untold oodles of data looking for patterns. Geneticists have done this, and turned an early torrent of specious results from genome sequencing into a trickle of truly significant ones. Ideally, research protocols should be registered in advance and monitored in virtual notebooks. This would curb the temptation to fiddle with the experiment's design midstream so as to make the results look more substantial than they are. (It is already meant to happen in clinical trials of drugs, but compliance is patchy.) Where possible, trial data also should be open for other researchers to inspect and test. The most enlightened journals are already becoming less averse to humdrum papers. Some government funding agencies, including America's National Institutes of Health, which dish out $30 billion on research each year, are working out how best to encourage replication. And growing numbers of scientists, especially young ones, understand statistics. But these trends need to go much further. Journals should allocate space for "uninteresting" work, and grant-givers should set aside money to pay for it. Peer review should be tightened—or perhaps dispensed with altogether, in favour of post-publication evaluation in the form of appended comments. That system has worked well in recent years in physics and mathematics. Lastly, policymakers should ensure that institutions using public money also respect the rules. Science still commands enormous-if sometimes bemused-respect. But its privileged status is founded on the capacity to be right most of the time and to correct its mistakes when it gets things wrong. And it is not as if the universe is short of genuine mysteries to keep generations of scientists hard at work. The false trails laid down by shoddy research are an unforgivable barrier to understanding. ■ Scientists like to think of science as self-correcting. To an alarming degree, it is not Trouble at the lab (4T SEE a train wreck looming," warned A Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned research on a phenomenon known as "priming". Priming studies suggest that decisions can be influenced by apparently irrelevant actions or events that took place just before the cusp of choice. They have been a boom area in psychology over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on "nudging" the populace. Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications h a ve fa il e d. I n Apri 1, for in stance, a pap e r in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan. The idea that the same experiments always get the same results, no matter who performs them, is one of the cornerstones of science's claim to objective truth. If a systematic campaign of replication does not lead to the same results, then either the original research is flawed (as the replicators claim) or the replications are (as many of the original researchers on priming contend). Either way, something is awry. To err is all too common It is tempting to see the priming fracas as an isolated case in an area of science-psychology—easily marginalised as soft and wayward. But ^reproducibility is much more widespread. A few years ago scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. According to a piece they wrote last year in Nature, a leading scientific journal, they were able to reproduce the original results in just six. Months earlier Florian Prinz and his colleagues at Bayer Healthcare, a German pharmaceutical giant, reported in Nature Reviews Drag Discovery, a sister journal, that they had successfully reproduced the published results in just a quarter of 67 seminal studies. The governments of the oecd, a club of mostly rich countries, spent $59 billion on biomedical research in 2012, nearly double the figure in 20QQ. One of the justifications for this is that basic-science results provided by governments form the basis for private drug-development work. If companies cannot rely on academic re-sea rch, th at re as oning br e a ks d 0 w n. Wh e n an official at America's National Institutes of Health (nih) reckons, despairingly, that researchers would find it hard to reproduce at least three-quarters of all published biomedical findings, the public part of the process seems to have failed. Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think. Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. "There is no cost to getting things wrong," says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline's persistent errors. "The cost is not getting them published." ►> Briefing Unreliable research 27 The Economist October 19th 2013 ► First, the statistics, which if perhaps off-putting are quite crucial. Scientists divide errors into two classes. A type i error is the mistake of thinking something is true when it is not (also known as a "false positive"). A type ii error is thinking something is not true when in fact it is (a "false negative"). When testing a specific hypothesis, scientists run statistical checks to work out how likely it would be for data which seem to support the idea to have come about simply by chance. If the likelihood of such a false-positive conclusion is less than 5%, they deem the evidence that the hypothesis is true "statistically significant". They are thus accepting that one result in 20 will be falsely positive—but one tn 20 seems a satisfactorily low rate. Understanding insignificance In 2005 John loannidis, an epidemiologist from Stanford University, caused a stir with a paper showing why, as a matter of statistical logic, the idea that only one such paper in 20 gives a false-positive result was hugely optimistic. Instead, lie argued, "most published research findings are probably false." Ashe told the quadrennial International Congress on Peer Review and Biomedical Publication, held this September in Chicago, the problem has not gone away. Dr loannidis draws his stark conclusion 011 the basis that the customary approach to statistical significance ignores three things: the "statistical power" of the study (a measure of its ability to avoid type 11 errors, false negatives in which a real signal is missed in the noise); the unlikeliness of the hypothesis being tested; and the pervasive bias favouring the publication of claims to have found something new. A statistically powerful study is one able to pick things up even when their effects on the data are small. In general bigger studies—those which run the experiment more times, recruit more patients for the trial, or whatever—are more powerful. A power of 0.8 means that of ten true hypotheses tested, only two will be ruled out because their effects are not picked up in the data; this is widely accepted as powerful enough for most purposes. But this benchmark is not always met, not least because big studies are more expensive. A study in April by Dr loannidis and colleagues found that in neuroscience the typical statistical power is a dismal 0.21; writing in Perspectives onPsydioIogicfll Science, Marjan Bakker of the University of Amsterdam and colleagues reckon that in that field the average power is 0.3s. Unlikeliness is a measure of how surprising the result might be. By and large, scientists want surprising results, and so they test hypotheses that are normally pretty unlikely and often very unlikely. Dr loannidis argues that in his field, epidemiology, you might expect one in ten hypoth- eses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct. With this tn mind, consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%-that is, 45 of them-will look right because of type 1 errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong. The negative results are much more trustworthy; for the case where the power is 0.8 there are 875 negative results of which only 20 are false, giving an accuracy of over 97%. But researchers and the journals in which they publish are not very interested in negative results. They prefer to accentuate the positive, and thus the error-prone. Negative results account for just 10-30% of published scientific literature, depending onthe discipline.This bias may be growing. A study of 4,600 papers from across the sciences conducted by Daniek Fanelli of the University of Edinburgh found that the proportion of negative results dropped from 30% to 14% between 1990 and 2007. Lesley Yellowlees, president of Britain's Royal Society of Chemistry, has published more than 100 papers. She remembers only one that reported a negative result. Statisticians have ways to deal with such problems. But most scientists are not statisticians. Victoria Stodden, a statistician at Stanford, speaks for many in her trade when she says that scientists' grasp of statistics has not kept pace with the development of complex mathematical techniques for crunching data. Some scientists use inappropriate techniques because those are the ones they feel comfortable with; others latch on to new ones without understanding their subtleties. Some just rely on the methods built into their software, even if they don't understand them. Not even wrong This fits with another line of evidence suggesting that a lot of scientific research is poorly thought through, or executed, or both. The peer-reviewers at a journal like Nature provide editors with opinions on a paper's novelty and significance as well as its shortcomings. But some new journals— PLoS One, published by the not-for-profit Public Library of Science, was the pioneer-make a point of being less picky. These "minimal-threshold" journals, which are online-only, seek to publish as much science as possible, rather than to pick out the best. They thus ask their peer reviewers only if a paper is methodologically sound. Remarkably, almost half the submissions to PLoS One are rejected for failing to dear that seemingly low bar. The pitfalls Dr Stodden points to get deeper as research increasingly involves sifting through untold quantities of data. Take subatomic physics, where data are I Unlikely results How a small proportion of false positives can prove very misleading False true False negatives I False positives 9BIII1 ■IllHIIIIIIII ■H■■■■■■■»■■■■■■■■■■ ■■■■■•■■■a BSBBaaBRBP■■•■■«■■■• ■ I<.....••■■■■•■■■■I ■■■■■■■■•Illllllllll .......... • V IK V 9 am Him MfBi ■■■■■ii •aaaa** asanas* aaaa ■ ■■■•■■a saaaaaa 1. Of hypotheses interesting enough to test, perhaps one in ten will betrue. So imagine tests on 1,000 hypotheses, 100 of which are true. •a ■1 1 aa t a a■a a■■a > ■ ■■■■■■mtiniiiiiii ■ ■■■iimimimm ■ ■■■■•(■•■■iiiiuni ■ ■■»•»■«■■■■■■■■■■ iiiiiiiiiiiiiiiiiiii a a a ■ • a »w ■ y a aaa a ■ a a aa ■■■niiiiM' ■■■■■■■■■■ ■ ■iiiiiiullllllllll •■••■■••■a ■■■■■■■■■■ IKlHtlillBllSEflu.. Hill ■••■■•••■(•■■■■■■■■I iaa■wwa aaj Ja a a■■a■*a aaaaaaaaaaaaaaaaaBaa BVaaaaVBaBBBBBBaaaBa va _, aa 2. The tests have a false positive rate of 5%. That means they produce 45 false positives (5% of 900). They have a power of 0.8, so they confirm only 80 of the true hypotheses, producing 20 false negatives. The new true aaaaaaaaaa msiiimi aataBBaaaa aawwaaaa*■ ■ •■■•»f