The Smeesters affair

Slides of talk on Smeesters case, and on the Geraerts-Merckelbach Memory paper affair. Talk originally given December 2012; slides updated March 2013; title "Integrity or fraud - or just questionable research practices?"
Stimulated by media interest in the Geraerts-Merckelbach controversy on their "Memory" paper, I studied the published summary statistics in this paper using the same techniques as Simonsohn used for Smeesters, and found quite clear statistical evidence for "too good to be true". Without experimental protocols written up prior to the experiment, original data-sets, and laboratory log books detailing all the data selection and manipulation steps which resulted in the final data-set on which the summary statistics in the paper are based, one can only guess how these patterns arose. It certainly need not be fraud (fraud requires active intention to deceive).

R-code for experiment with Simonsohn's fraud test (new version)
Histogram of p-values of an honest researcher
Histogram of p-values of a dishonest researcher

Disclaimer: These notes formed an attempt to reconstruct the statistical analyses performed by Erasmus University Committee on Scientific Integrity, based intitially only on the censored version of the report of the committee which Erasmus released at the start of the affair. Later an uncensored report was made available, and later still Uri Simonsohn has made a "working paper" available which reveals yet more details of his methodology. The uncensored Erasmus report still leaves, for me, many unanswered questions concerning their exact methodology.

The main resource for these notes was therefore the report of the Erasmus University Committee on Scientific Integrity (commissie wetenschappelijk integriteit, CWI). I am very grateful for recent communication with fraud-buster Uri Simonsohn, social pychologist at Wharton University School of Operations and Information Management, who pointed out a major error in an earlier version of my R experiment and also in my thinking! The results given here are still speculative and my understanding of the statistical procedures used by Erasmus-CWI (who refuse to comment) may be far from correct.

Simonsohn's web page contains links to two interviews he has given in "Nature" and in the Dutch newspaper "Volkskrant" (English translation). For Smeesters' version of events see the interview with Smeesters in the Belgian (Flemish) newspaper De Standaard; part two of this interview is unfortunately only available for subscribers.

Just recently Uri has released a new working paper called "Just post it: the lesson from two cases of fabricated data detected by statistics alone". A link can be found at his homepage. The paper contains another link to some useful supplementary material.

According to the Erasmus-CWI report, Simonsohn's idea was that if extreme data has been removed in an attempt to boost significance, the variance of sample averages will decrease. Now researchers in social pyschology typically report averages, sample variances, and sample sizes of subgroups of their subjects, where the groups are defined partly by an intervention (treatment/control) and partly by covariates (age, sex, education ...). So if some of the covariates can be assumed to have no effect at all, we effectively have replications: i.e., we see group averages, sample variances, and sample sizes, of a number of groups whose true means can be assumed to be equal. Simonsohn's test statistic for testing the null-hypothesis of honesty versus the alternative of dishonesty is the sample variance of the reported averages of groups whose mean can be assumed to be equal. The null distribution of this statistic is estimated by a simulation experiment, by which I suppose is meant a parametric bootstrap in the situation where we do not have access to the orginal (but post-massage) data of the experiment. If the "original" data is available we could use the full (non-parametric) bootstrap.

In the present experiment I used a parametric bootstrap under an assumption of normality. Within the bootstrap, we pretend that the reported group variances are population values, we pretend that the actual sample sizes are the reported sample sizes, and we play being an honest researcher who takes normally distributed samples of these sizes and variances, with the same mean.

I did 1000 simulations of honest and of dishonest researchers in the following scenario: each researcher takes a sample of size 20 from each of 5 groups (standard normal distributions, same means and variances). The dishonest researcher discards all observations larger than 0.5, and all observations smaller than -2.0. He or she is attempting to increase the significance of the difference between these combined groups of subjects with other groups, by making the mean value for the groups whose statistics we are studying a whole lot lower, by removal of a massive number of observations (everything bigger than 0.5). Moreover, variance is being further reduced by removal of a few very small observations (everything smaller than -2.0). The sample size is reduced to about two thirds of its original size in this way. Simonsohn's test statistic is the sample variance of the group means. Its null distribution is estimated in my experiment by parametric bootstrap. My bootstrap sample size was 1000 and each bootstrap consists of five normal samples with equal means, sample sizes equal to the reported sample sizes (varying in the dishonest case), and variances equal to the reported sample variances. So each bootstrap generates one observation of "variance of the five group means". The relative frequency of bootstrap test-statistics smaller than the actually observed test-statistic is the p-value according to the empirical bootstrap null distribution of the test statistic. The idea is that the faked data display less variation than their summary statistics would lead one to expect. The histograms displayed here are histograms of these bootstrap p-values for a one-sided test (reject for small values) for honest and for dishonest researchers. Fortunately for the honest researchers, their p-values are close to uniformly distributed. The dishonest researchers on the other hand tend to have p-values which are much smaller, as we had expected.

This little experiment, which takes 80 seconds to run on my 2.4 GHz Intel Core 2 Duo MacBook Pro, shows that Simonsohn's test statistic could be a valuable tool. Indeed, this massive and one-sided data-amputation has decreased the variance of group averages, relative to what one would expect from the reported groups' sample sizes and standard deviations. To put it another way, the actual variation in the reported group averages is too good to be true relative to the reported standard deviations and reported sample sizes.

In my experiment, the test does indeed have actual size very close to 5%, when used at nominal level 5% . Its power against the alternative which I took is about 12%. Not extremely exciting, but the idea is to combine many tests over many groups of groups of subjects, and possibly also over a number of experiments reported in the same paper, or even over a number of papers by the same researcher.

It would be interesting to see if a non-parametric bootstrap gives similar results. The massaged data has a far from normal distirbution.

I have the following (provisional) four criticisms of the methodology and the reporting thereof.
(1) We can be sure that honest or dishonest, the original data is not normally distributed. So statistical conclusions based on a parametric bootstrap are rather tentative.
(2) The Erasmus report told us nothing about how Simonsohn got onto Smeesters' tail. Was he on a cherry-picking expedition, analysing hundreds of papers or hundreds of researchers, and choosing the most significant outcome for the follow-up? This is not revealed in the Erasmus report, but it is important to know in order to judge the significance of Erasmus' re-analysis of the same data. According to Simonsohn, someone drew his attention to one of Smeesters' papers in August 2011 (The effect of color (red versus blue) on assimilation versus contrast in prime-to behaviour effect, coauthor Jia Liu, University of Groningen). Why? Knowing Simonsohn's reputation, this was presumably someone who had his or her own suspicions. Why? According to the Erasmus-CWI report, Simonsohn concluded from his statistical methodology that that the summary statistics in that paper were too good to be true. Simonsohn requested and obtained Smeesters' dataset and discovered more anomalies which confirmed his initial opinion. Smeesters tried to explain some of the anomalies but Smeesters' explanations would make the observed anomalies less likely, not more likely. At this point Smeesters' computer crashed and all original material of all of his experiments, ever, was lost. Other statistical analyses, replicated by CWI and described in Appendix 4 of the Erasmus report, confirmed that these further striking patterns in the data (of completely different nature to what we are studying here) are extremely unlikely to have arisen by chance. Smeesters' data has not been released so it is still not possible to replicate the most crucial parts of the analysis: those which, as it were, independently confirmed the initial suspicions of something worse than data-massage.
(3) Erasmus-CWI uses the pFDR method (positive - False Discovery Rate) in some kind of attempt to control for multiple testing. In my opinion, adjustment of p-values by pFDR methodology is absolutely inappropriate in this case. It includes a guess or an estimate of the a priori "proportion of null hypotheses to be tested which are actually false". Thus it includes a "presumption of guilt"! The pFDR method, when correctly used, guarantees that of those results which are nominally significant at the pFDR-adjusted level 0.05, only 5% are false positives. Thus 95% of the "significant" results are for real - if the a priori guess/estimate is correct, and in the long run! This methodology was invented for massive cherry-picking experiments in genome wide association studies. It was not invented to correct for multiple testing in the traditional sense, when the simultaneous null hypothesis should be taken seriously. Innocent till proven guilty. Not proven guilty by an assumption that you are guilty some significant proportion of the times. In order that Smeesters himself is protected from cherry picking by Erasmus-CWI he should have insisted on a Bonferroni correction of the p-values which they report; a much stronger requirement than the pFDR correction.
(4) The CWI report is itself not a good model of reporting statistical analyses. There are hundreds of different pFDR methods, which one was used? Simonsohn's paper is not only unpublished by still unobtainable, and the description in the Erasmus report is terse. One cannot reproduce their analyses from their description of their procedures. One might object that this report is the result of an internal investigation of an organisation carrying out disciplinary investigation of one of its employees, hence that Erasmus University is actually behaving with unusual and exemplary transparency. I would retort that by publishing this report Erasmus university is broadcasting a public condemnation of Smeesters as a scientist, which goes far beyond the internal needs of an organisation in the unfortunate situation when it has to terminate the employment of an employee for some misdemeanors. Erasmus university has withdrawn a number of Smeesters' publications; not the authors of those publications. I am amazed that the statisticians involved in the investigation at Erasmus are apparently forbidden to reveal the smallest technical details of their analyses to fellow scientists.

My overall opinion is that Simonsohn has probably found a useful investigative tool. In this case it appears to have been used by the Erasmus committee on scientific integrity like a medieval instrument of torture: the accused is forced to confess by being subjected to an onslaught of vicious p-values which he does not understand. Now, in this case, the accused did have a lot to confess to: all traces of the original data were lost (both original paper, and later computer files), none of his co-authors had seen the data, all the analyses were done by himself alone without assistance. The report of Erasmus-CWI hints at even worse deeds. However, what if Smeesters had been an honest researcher?

Incidentally, in my opinion cherry-picking and data-massaging in themselves are not evil. What is evil, is not honestly reporting your statistical procedures, and that includes all selection and massaging. A good scientist reports an experiment in such a way that others can repeat it. That includes the statistical analyses; and that includes the methodology by which you choose which results of which experiments to report, and how your data has been massaged.

In physics, the interesting experiments are immediately replicated by other research groups. Interesting experiments are experiments which push into the unknown, in a direction in which there are well-known theoretical and experimental challenges. Experiments are repeated because they give other research groups a chance to show that their experimental technique is even better, or to genuinely add new twists to the story. In this way, bad reporting is immediately noticed, because experiments whose results cannot be replicated immediately become suspect. Researchers know that their colleagues (and competitors) are going to study all the methodological details of their work, and are going to look critically at all the reported numbers, and are going to bother them if things don't seem to match or important info is missing. In particular, if the experiment turns out to be methodologically flawed, you can be sure someone is going to tell that to the world.

The problem in social psychology (to reveal my own prejudices about this field) is that interesting experiments are not repeated. The point of doing experiments is to get sexy results which are reported in the popular media. Once such an experiment has been done, there is no point in repeating it.
Return to my homepage ...

(Last updated: 15 December 2014).