Lying Statistics Damn Nurse Lucia de B

Petition for a retrial of the Lucia de Berk case

WARNING: I am a statistician, not a lawyer. I don’t have access to original sources. There is disagreement over the events leading to Lucia’s conviction. I discuss here only the statistical aspects of the case, except in so far as non-statistical information is necessary to follow the statistical arguments.

Nederlandse samenvattingen: lucia.txt, interview Opinieleiders.nl, interview in Mare
Read the conclusions of the courts (in Dutch) at “www.rechtspraak.nl”: LJ-numbers AE8436 (08-10-2002), AP2846 (18-06-2004), AY3864 (13-07-2006).

In the original conviction of Lucia, a statistical analysis played an explicit role. After the appeal, the verdict was maintained but the statistical calculation removed. However, the data remained and the “obvious” coincidence between incidents and Lucia’s presence remains a crucial step in the prosecution’s case as well as an influence on the evaluation by medical specialists of the medical evidence. I discern the following four biases in the data-gathering and analysis. The fourth bias applies only to the analysis of the data by Henk Elffers. This elementary blunder was not noticed in the original trial by anyone, but it contributed a factor of about one thousand to the famous “one in 342 million” which was widely publicized in the media and has since been in the minds of everyone concerned with the trial, the appeal, and the final hearing in the supremum court.

  • Confirmation Bias, raw data is interpreted and manipulated in view of a pre-desired conclusion
  • Selection Bias, sequentially acquired data (stopping rule: “non-sequential” p-value is small enough) is analyzed as though non-sequential
  • Selection Bias bis, relevant statistical information is discarded
  • A technical blunder further biases the conclusion: combination of p-values by multiplication instead of by the easy “last resort” Fisher’s method or the more appropriate Cochran-Mantel-Haenszel test; see the standard textbook Agresti (2002), website Categorical Data Analysis, and the accompanying manual Thompson (2006).
  • Ton Derksen, emeritus professor of philosophy at the University of Nijmegen, and his sister Metta de Noo (medical doctor) believe that Dutch nurse Lucia Isabella Quirina de Berk (“Lucia de B”) has been convicted of seven murders and three attempted murders using (among other things) statistical arguments combined with abuse of every basic rule in the statistics text-book. The more I learn about this case the more I am inclined to believe them. The statistical methodology in the case has been discussed extensively for the last five years by the Dutch scientific community, but uncritically accepting the raw numbers presented by the prosecution.

    The case for the prosecution makes crucial use of the coincidence between “suspicious” (unexplainable) “incidents” (crisis situations in medium-intensive care, involving attempted reanimation) and her presence. Hence incidents are caused by her presence. Hence these incidents are murder-attempts, most of them succesful.

    This “statistical argument” does not stand alone. In two cases the judge considers that medical evidence proves that she murdered, and attempted murder, respectively. I am not qualified to evaluate those cases, both of which are contested by medical experts, but together with (contested) psychological analysis of her personality and personal history they are supposed to show that she is a killer with no conscience, no motive, and superb skill in wiping out all traces of her activity. A supposed similarity between these two cases and the other eight selected by the prosecution, is enough to convict her of those eight too. The suggestion that this is just the tip of the iceberg is not part of the legal case (the media initially reported that she was involved in at least 30 suspicious deaths).

    I remind you that the following description of crucial events leading up to the case is disputed.

    Amber, a 6 months old baby born with multiple genetic disorders died unexpectedly in medium-care. Lucia was on duty and though Lucia had been liked by most of her colleagues, there had recently been gossip about her attitude, her clothing, her past and “strange incidents around her”. In fact Lucia herself had been to see her supervisor, upset because it seemed that deaths and other incidents always happened during her shifts. When Amber died it was too much for one of her colleagues, who told the paediatrican who went to the director. Amber had died in the night of 4 September 2001, the death filed as “natural”. By the next afternoon the paediatricians agreed the death was “unnatural”, the hospital’s director had informed the police, and fired Lucia from her job. Within a few days she had been associated with a total of 30 incidents at several hospitals where she had recently worked. Quite a few incidents however could later not be linked to Lucia after all (e.g., she was on vacation ...) and became “normal events” again. Very early on, the hospital director had an amateur statistical calculation done, concluding that the chance was one in seven billion that Lucia was innocent. On 16 September the hospital formally reported five murders and five attempted murders to the police.

    At an early stage, the police consulted Henk Elffers, professor, law faculty, University of Antwerp, and senior researcher at the research institute NSCR (the Netherlands Institute for the Study of Crime and Law Enforcement is run by the Dutch national science foundation NWO). Elffers quit statistics some 30 years ago, half way into his PhD studies, and moved to geography, psychology, economics and law. He convinced the police that the “coincidence” could not be due to chance. He told me that at that stage, the police still had the idea that this was a big fuss about nothing. He later submitted official reports to the court, stating that the probability was 1 in 342 million that such an extreme number of incidents would be concentrated in her shifts purely by chance. His reports are available (from me) for scientific study but are not entirely in the public domain. I hope some time to prepare an English translation of them. Elffers’ analysis involved data from another hospital where Lucia had previously worked, but, strangely, not from three others where she is supposed also to have been active, inbetween those two spells. The data is reproduced and reanalyzed in a paper by Meester, Collins, Gill and van Lambalgen, with discussion, in Law, Probability and Risk. The court asked expert advice from Mr. Richard de Mulder, who is professor, law faculty, Erasmus University Rotterdam. According to de Mulder, they needed Elffers’ analysis explained to them by a non-statistician - de Mulder has an MBA and specializes in IT in law. His credentials were that he had written an article in a Dutch legal-and-IT magazine about some statistical concepts, some ten years earlier. He declared officially to the court that Elffers’ analysis was absolutely correct and proper. Coincidentally he had been collaborating with Elffers during the previous three or four years in research on computer-guided teaching and examination for law students.

    One in 342 million, one in seven billion – tiny probabilities were common knowledge from the very beginning of the case. Everyone involved in the police investigation (the family of victims, hospital staff, medical experts) knows they have a serial killer on their hands, the worst in the history of the Netherlands, and knows who she was ... Elffers has repeated to me many times some of the rumours and scare stories about the terrible killer nurse, which we now know have absolutely no basis in any facts. Moreover he was deeply moved by the feelings of the poor parents who had lost their children.

    There followed a court case and a conviction and intense media coverage; a lost appeal; a second lost appeal at the highest possible level. At each appeal procedural mistakes were corrected and evidence withdrawn but essentially the conviction on the original grounds was maintained. Presently, a committee installed to investigate possible miscarriages of justice, after several serious recent occurrences, might be about to decide to allow this case to be reopened. For this it is necessary that they admit that there is “new evidence”.

    For more information, in Dutch, read Derksen’s analysis of the case, or see the main site supporting Lucia de B (de Noo, Derksen), now also in English, and Dutch Wikipedia. There is also an English version of the Dutch site, and an English wikipedia page.

    I am not aware of any readable and sufficiently detailed documentation, from the point of view of the prosecution, in the public domain. “Hard” English information of any kind is almost nonexistent. Here’s an article from the New York Times on the first case, something from the BBC on her appeal, and finally a small item on the possible reopening of the case. For lurid details, just do a Google search on “Dutch serial killer nurse”. Please do not believe everything you read in the newspapers; or on internet for that matter ;-)

    It appears now that there are quite a few statistical problems with the statistical evidence (with or without the statistical calculation) on the basis of which she was convicted of murders and attempted murders. The following list corresponds point by point with the four deadly sins I catalogued above:

  • Whether or not an “incident” is classified as “suspicious” depends on whether or not Lucia is on duty: The specialist who does the classification has the full case-dossier, knows which nurse was on duty and why he is being asked to classify the incident. In several cases, there are differing opinions and then the desired opinion is selected (expert A is preferred over B in one case, and vice-versa in another). One specialist said explicitly that a particular death was suspicious only because Lucia was on duty. There are plenty of unexplainable deaths to choose from in a medium care ward where many patients (e.g., babies with severe birth defects) die “unexpectedly”; in another hospital, we are talking about aged and terminally ill patients. The data gatherers know there was a serial killer on the loose and know if she was around or not (though sometimes they make mistakes). Prosecution statistician Elffers puts down guidelines for how the classification should be done objectively [should have been done?] but this does not mean that those guidelines were followed.

  • Data is collected sequentially till it condemns her (stopping rule); not just from her original ward but also from two wards at another hospital where she had recently worked; but not from three other hospitals where she earlier worked as trainee and is also supposed to have committed murders. Two almost identical, adjacent wards are separated – nicely boosting the outcome of the statistical formula (an incorrect statistical formula, see point four). Another adjacent and almost identical ward, where she initially was thought to have murdered, is not included (because there were many incidents there, but none which could be connected to Lucia after all?)

  • Curious that a mass murderer could kill so many people and simultaneously take care that the total number of deaths on the ward is actually lower than in a similar period before she worked at this hospital: this data is not incorporated in the analysis or even made available! Of course, if she was only practising euthanasia, speeding by a few weeks the deaths of patients who were going to die anyway, this is not relevant. But the prosecution claims that many of the children and old people she killed were doing very well and were not going to die ...

  • Expert for the prosecution Henk Elffers, Professor at the University of Antwerp, senior-researcher at the Netherlands Institute for the Study of Crime and Law Enforcement, (a friend, former colleague and co-author of mine from our Mathematical Centre days thirty years ago, before he moved to geography, economics, psychology and law) apparently does not know the meaning of p-value. He multiplies three independent p-values (from three wards where Lucia has worked) and appears to present the product as a p-value rather than using one of the well-known ways to compensate for the number of statistical tests being combined. This error biases the result against Lucia. In fact, two of the wards are almost identical, adjacent wards in the same hospital. There seems no reason to split them except perhaps to get the (apparent) p-value smaller. He even remarks in his report that if more data becomes available from the other hospitals, the probability will become smaller still!
  • Lawyers and doctors find it difficult to grasp that rejecting the null hypothesis of pure randomness does not imply one must believe the alternative of murder (whether this is done formally by a statistical calculation or informally by just glancing at the numbers). Maybe Lucia had more night shifts? Maybe she was such a conscientious nurse, that she sometimes noticed a problem with a patient which actually was a false alarms? (At least one of the incidents in her shifts is exactly of this nature). The court looked summarily at a couple of such possible explanations, which had been mentioned by Elffers “by way of example”, but the court made no effort whatsoever to investigate others, mentioned by Meester, and one (time variation) mentioned implicitly by Elffers (he did not want to look at other time periods because of possible changes over time). As far as I know, no research has ever been done on variation between nurses, or on clustering and time variation in a ward. It is an axiom of the medical professsion that there is no difference between different nurses but the medical profession is not always so sharp in probability theory, cf. Sir Roy Meadow and the British cot deaths... In fact, nurses have a lot of influence on which shifts they work and even use this to avoid, or in other cases to experience, shifts which are expected to be particularly hard. We know now that in the weeks before September 4 one of Lucia’s supervisors was already suspicious of her and deliberately assigned her a difficult shift in order to see what would happen.

    The statistical calculation is no longer part of the conviction – the statistical community would not provide the probability, despite repeated requests by the judge (!). In fact the statisticians’ public dispute has been so vociferous that the judge obviously felt it unsafe to rely on any particular number. Consequently, statisticians no longer take much notice of the case. However, the numbers, which appear to speak for themselves (but what do they say?), do remain a vital part of the conviction.

    Incidentally, Lucia de Berk has throughout insisted on her innocence, a fact used against her to paint her as a pathological liar; the lack of any medical signs of poisoning shows what a refined killer she was.

    The statistical profession in the Netherlands, myself included, is deeply shamed that no-one from my profession has really investigated how the data was actually gathered ... It has taken five years, three court cases, several scientific meetings, numerous articles in newspapers and semi-professional magazines and professional journals, before medical doctor Metta Z. de Noo and her brother, philosopher of science Ton Derksen from the University of Nijmegen, have gone deeply into that question ...

    History of statisticians’ involvement: Elffers, later supported by Mr. R.V. de Mulder (specialist on law and IT) does the statistics for the investigation and later for the court. Henk Elffers holds a chair in psychology and law, but he has many interests and was initially trained as a mathematician and statistician. Prof. Ronald Meester (probabilist) and Prof. Michiel van Lambalgen (logician and cognitive scientist) become experts for the defense – feeling from the newspaper reports that something is wrong they offer their services of their own accord, and uncover many irregularities. Because they don’t think that Elffer’s null hypothesis is relevant, and they don’t like significance tests anyway, they do not put much emphasis on the fact that within Elffer’s chosen paradigm (and pretending the data is gathered unbiasedly) his statistical analysis is just wrong (in fact, I think they did not understand his analysis at all; their criticism certainly holds water, but would have been so much more devastating if it had been more professional and less academic and if they had made the tactical choice to argue a completely different much larger probability, rather than to argue that in general a statistical analysis was altogether impossible). More newspaper reports and talks lead to a public but in my opinion totally irrelevant debate between Bayesian and freqentist statisticians. I am the first to agree that “applied probability” or “empirical Bayes” calculations or thought experiments are highly enlightening; they too explode the prosecution’s case. Even under pessimistic assumptions on the proportion of serial murderers among the Dutch nursing population, the number of innocent nurses who would accidently be convicted of murder by statistics is much larger than the number of killers who would be caught. Even worse, allowing for “innocent” heterogeneity among nurse’s rates of experiencing incidents, Lucia’s numbers are not so extreme at all. Unfortunately the results of thought experiments are not allowed to be used in the courtroom. It is strange that science does progress in this way, but justice does not, yet both are in search of the truth.

    The debate probably confirms the public’s view that statistics is all lies and it does not do much credit to the statistical profession. In retrospect the debate is doubly irrelevant since no-one has the data and no-one knows how and why the data was collected and no-one asks. The problem with post-hoc testing of hypotheses is debated widely, but most participants don’t realise that the data is being used not just to prove Lucia’s guilt but simultaneously to prove that murders are being committed. No-one suspects that the data could have been essentially fabricated to establish a statistical correlation, hence suspicion of foul play, hence murder, and hence a murderer. Throughout all this, the appalling ignorance of probability and statistics in the legal and medical professions is sometimes forgotten. Concerning scientific evidence in general, it appears that a Dutch judge can choose from conflicting scientific opinions, picking even the minority opinion, and whether on statistics, medicine or psychology, with the only motivation that the choice supports the (foregone) conclusion of the judge. Though the statistical arguments find their way into the public domain and eventually into the international – English language – domain, the medical and psychological arguments remain largely hidden and certainly not discussed in any international arena.

    Let me end on an optimistic note. Expert witness for the prosecution Henk Elffers always insisted that his analysis only showed that the observed coincidence could not be due to pure chance, not that Lucia caused the the deaths. I have always agreed. Perhaps he is right in a way he did not suspect. If de Noo’s and Derksen’s findings are correct, which I believe them to be, statistical considerations, especially meta-statistical considerations (considerations of statistical sampling methodology – or if you like, proper scientific procedures), show that there is no way to draw any support for the prosecution from the statistical data that was previously used to convict Lucia de Berk of multiple murders. I trust that lawyers will consider this “new evidence”. I refer here to the “committee Posthumus II”. They can only recommend reopening of any case if tunnel-vision and misuse of scientific evidence by the police and/or public prosecution has possibly led to a miscarriage of justice. It is apparently impossible that Dutch judges could have failed in any way.


    The above writing (original text 5 and 6 November 2006, since then, many times updated) was precipitated by an email to me (4 November) from computer scientist Peter Grünwald. Here is his eloquent (Dutch) analysis of the then situation.

    Article by author Maarten ’t Hart, from NRC Handelsblad, October 7: A Modern-day Witch-trial (in Dutch). Similarity with the case of Beethoven’s great-grandmother, burnt to death on Brussels market square, because a cow died when she was in the neighbourhood (and then a couple more farmers reported the same had happened to them).

    My earlier statistical contribution to the debate is incorporated in the paper: On the (ab)use of statistics in the legal case against the nurse Lucia de B available at Cornell on arXiv.org, and now appeared with discussion in Law, Probability and Risk, joint with Marieke Collins, Michiel van Lambalgen, Ronald Meester; somewhat overtaken by more recent events ...

    Some statistical notes, R scripts, data and data-analysis; you might like to install the doubly free statistical package R (aka GNU S), and use copy and paste to learn some statistics for yourself. You will find in my notes, both the original three 2x2 tables analysed by Elffers, as well as the new three 2x2 tables presented to the three wise men of Posthumus II by Derksen; discussion on interpretation of the numbers; ideas for various analyses not yet done; various p-values. You can now see the original raw data as two (csv format) spreadsheets RKZ.csv, JKZ.csv. Here are graphical representations, RKZ.pdf and JKZ.pdf. The original spreadsheets had columns and rows reversed. I have them too, if anyone wants to go back to the true originals.

    One in Nine innocent nurses go to jail , if they dress unconventionally. This new little paper (revised, joint with Piet Groeneboom) explores the consequence of innocent heterogeneity among nurses’ shifts: something which most nurses find almost obvious, most medical specialists find impossible. Another paper-in-preparation analyzes Elffers’ mistaken multiplication of p-values. The Mantel-Haenszel test is about strengths and weaknesses of this test - bread-and-butter in modern epidemiology and medical statistics, introduced about the same time as Henk Elffers quit mathematical statistics 30 years ago. That no statistician involved in this case had ever heard of this method tells us something about the amateurism which unfortunately also penetrates the medical and psychological evidence.

    Here is Peter Grunwald’s recent talk (in Dutch) at CWI on statistical priestership. He studies in turn the three doors problem, the three prisoners problem, and the one prisoner problem (Lucia). Here is Willem van Zwet’s talk at Eurandom on statistics and law and Lucia. Van Zwet was closing speaker at the Dutch Statistical Society’s annual Statistics and OR Day, March 27, CBS, The Hague. I was the discussant. Links to newer material still, can be found on my homepage. I especially recommend Science vs. Justice, talk to the Dutch science journalists at Bessensap conference.

    Henk Elffers has requested that his statistical reports for the prosecution in the original court case be put in the public domain. The request was honoured in the sense that the the documents may now be shown to interested scientists, but not to journalists. So for interested scientists, here are links to English translations of his reports. The two short documents, one of 7 pages and dated 8 May 2002, and one of 3 pages dated 29 May 2002, contain no sensitive or personal information. All factual information contained in them is in fact already in the public domain. Not yet public is de Mulder's testimony. Henk admits that in retrospect, his presentation of his conclusions is unfortunate and could have mistakenly led the reader to believe that his “one in 342 million” should be interpreted as a p-value. He also regrets that he gave way to pressure from the court and adjusted the definition of “incident” for legal reasons, and attempted to combine the data from the two hospitals JKZ and RKZ, which had been gathered in very different circumstances at quite different stages of the investigation. He likes to remind all concerned, that the current formal position of the Dutch courts is that

    1) shift-data statistics play no role whatsoever in the sentencing of Lucia de B.

    2) current jurisprudence forbids it, anyway

    He thinks the Dutch courts are very wise to take these positions. Also,

    he (H.E.) thinks that the case should be reopened.

    Synopsis/reconstruction of the case

    Petition for a retrial of the Lucia de Berk case

    I can’t resist adding a note concerning the irrelevant issue of Bayesian versus frequentist statistics: I wonder how many Bayesians would have included an a priori probability that the data was wilfully manipulated (with the best of intentions, of course), and what it would have been, and whether the court would have admitted this element of the statistical analysis. I am reminded of the saying “80% of statistics are just made up”.

    Research shows that about half of the applications of statistics in scientific journals are seriously flawed. Amusingly though, the blunder of multiplying p-values is not commonly mentioned as one of the common mistakes. Of course one can multiply probabilities of independent events if one is interested in the probability of their simultaneous occurrence. But why should one be interested in that probability? Anyway, a p-value is a convenient transformation of test statistics, so that anyone can do the test according to their personal significance level. No more no less than that. Journalists and lawyers are typically very proud of themselves when they know about multiplication of probabilities under independence. Unfortunately a p-value is not a probability of a prechosen event and it doesn’t make sense to pretend it is. A letter to the editor of a Dutch popular science journal, pointing out this mistake, was not published because, as the editor said, “everyone knows you can multiply independent probabilities”.

    A meta-analysis of the separate p-values from the various wards in which Lucia worked reminds me very of the research on paranormal phenomena. An initial publication shows a highly significant result. Attempts to duplicate the phenomenon tend to show up just significant results, e.g., just significant at the 5% level if one uses a one-sided test. A suspicion remains that some non-significant results are not published (they would not be accepted for publication anyway). The p-values of the published cases, after the first one, are remarkably independent of the size of the study. Thus, the more repetitions, the less strong is the effect. Presumably the subject gets tired, or his spirits get bored. Some ESP researchers (e.g., Dick Bierman) rather associate this with quantum theory, since though quantum theory does allow very unlikely events to happen, in the long run the averages have to settle down to the expectation values.


    Return to my homepage ...

    (Last updated: 28 May 2007).