**Petition for a retrial of the Lucia de Berk case**

WARNING: I am a statistician, not a lawyer. I don’t have access to original sources. There is disagreement over the events leading to Lucia’s conviction. I discuss here only the statistical aspects of the case, except in so far as non-statistical information is necessary to follow the statistical arguments.

Read the conclusions of the courts (in Dutch) at “www.rechtspraak.nl”: LJ-numbers AE8436 (08-10-2002),

In the original conviction of Lucia, a statistical analysis played an explicit role. After the appeal, the verdict was maintained but the statistical calculation removed. However, the data remained and the “obvious” coincidence between incidents and Lucia’s presence remains a crucial step in the prosecution’s case as well as an influence on the evaluation by medical specialists of the medical evidence. I discern the following four biases in the data-gathering and analysis. The fourth bias applies only to the analysis of the data by Henk Elffers. This elementary blunder was not noticed in the original trial by anyone, but it contributed a factor of about one thousand to the famous “one in 342 million” which was widely publicized in the media and has since been in the minds of everyone concerned with the trial, the appeal, and the final hearing in the supremum court.

Ton Derksen, emeritus professor of philosophy at the University of Nijmegen, and his sister Metta de Noo (medical doctor) believe that Dutch nurse Lucia Isabella Quirina de Berk (“Lucia de B”) has been convicted of seven murders and three attempted murders using (among other things) statistical arguments combined with abuse of every basic rule in the statistics text-book. The more I learn about this case the more I am inclined to believe them. The statistical methodology in the case has been discussed extensively for the last five years by the Dutch scientific community, but uncritically accepting the raw numbers presented by the prosecution.The case for the prosecution makes crucial use of the coincidence between “suspicious” (unexplainable) “incidents” (crisis situations in medium-intensive care, involving attempted reanimation) and her presence. Hence incidents are caused by her presence. Hence these incidents are murder-attempts, most of them succesful.

This “statistical argument” does not stand alone.
In *two* cases the judge considers that medical
evidence proves that
she murdered, and attempted murder, respectively.
I am not qualified to evaluate those cases,
both of which are contested by medical experts,
but together with (contested) psychological
analysis of her personality and personal history
they are supposed to show that she
is a killer with no conscience, no motive, and superb
skill in wiping out *all* traces of her activity.
A supposed similarity between these two cases and the other eight
selected by the prosecution,
is enough to convict her of those eight too.
The *suggestion* that this is just the tip of the iceberg
is not part of the legal case (the media initially
reported that she was involved in at least 30 suspicious deaths).

I remind you that the following description of crucial events leading up to the case is disputed.

Amber, a 6 months old baby born with multiple genetic disorders died unexpectedly in medium-care. Lucia was on duty and though Lucia had been liked by most of her colleagues, there had recently been gossip about her attitude, her clothing, her past and “strange incidents around her”. In fact Lucia herself had been to see her supervisor, upset because it seemed that deaths and other incidents always happened during her shifts. When Amber died it was too much for one of her colleagues, who told the paediatrican who went to the director. Amber had died in the night of 4 September 2001, the death filed as “natural”. By the next afternoon the paediatricians agreed the death was “unnatural”, the hospital’s director had informed the police, and fired Lucia from her job. Within a few days she had been associated with a total of 30 incidents at several hospitals where she had recently worked. Quite a few incidents however could later not be linked to Lucia after all (e.g., she was on vacation ...) and became “normal events” again. Very early on, the hospital director had an amateur statistical calculation done, concluding that the chance was one in seven billion that Lucia was innocent. On 16 September the hospital formally reported five murders and five attempted murders to the police.

At an early stage, the police consulted
Henk Elffers,
professor, law faculty, University of Antwerp,
and senior researcher at the research institute
NSCR
(the Netherlands Institute for the Study of
Crime and Law Enforcement is run by the
Dutch national science foundation
NWO).
Elffers quit statistics some 30 years ago, half way into his PhD
studies, and moved to geography, psychology, economics and law.
He convinced the police
that the “coincidence” could not be due to
chance. He told me that at that stage, the police still had
the idea that this was a big fuss about nothing.
He later submitted official reports to the court,
stating that the probability
was 1 in 342 million that such an extreme number of incidents
would be concentrated in her shifts purely by chance.
His reports are available (from me) for scientific study
but are not entirely in the public domain. I hope some
time to prepare an English translation of them.
Elffers’ analysis
involved data from another hospital where Lucia had previously worked,
but, strangely, not from three others where she is supposed also to have been
active, inbetween those two spells. The data is reproduced and reanalyzed in a
paper by
Meester, Collins, Gill and van Lambalgen,
with discussion, in
*Law, Probability and Risk*.
The court asked expert advice from
Mr. Richard de
Mulder, who is
professor,
law faculty, Erasmus University Rotterdam. According to de Mulder,
they needed Elffers’ analysis explained to them by a non-statistician -
de Mulder has an MBA and specializes in IT in law.
His credentials were that he had written an article in a Dutch
legal-and-IT magazine about some statistical concepts, some
ten years earlier. He declared officially
to the court that Elffers’ analysis was absolutely correct and
proper. Coincidentally he had been collaborating with Elffers during
the previous three or four years in research on computer-guided
teaching and examination for law students.

One in 342 million, one in seven billion – tiny probabilities were common knowledge from the very beginning of the case. Everyone involved in the police investigation (the family of victims, hospital staff, medical experts) knows they have a serial killer on their hands, the worst in the history of the Netherlands, and knows who she was ... Elffers has repeated to me many times some of the rumours and scare stories about the terrible killer nurse, which we now know have absolutely no basis in any facts. Moreover he was deeply moved by the feelings of the poor parents who had lost their children.

There followed a court case and a conviction and intense media
coverage; a lost appeal; a second lost appeal
at the highest possible level. At each appeal procedural
mistakes were corrected and evidence withdrawn
but essentially the conviction on the original grounds
was maintained. Presently, a committee
installed to investigate possible miscarriages of justice,
after several serious recent occurrences,
*might* be about to decide to allow this case
to be reopened. For this it is necessary that they
admit that there is “new evidence”.

For more information, in Dutch, read Derksen’s analysis of the case, or see the main site supporting Lucia de B (de Noo, Derksen), now also in English, and Dutch Wikipedia. There is also an English version of the Dutch site, and an English wikipedia page.

I am not aware of any readable and sufficiently detailed documentation, from the point of view of the prosecution, in the public domain. “Hard” English information of any kind is almost nonexistent. Here’s an article from the New York Times on the first case, something from the BBC on her appeal, and finally a small item on the possible reopening of the case. For lurid details, just do a Google search on “Dutch serial killer nurse”. Please do not believe everything you read in the newspapers; or on internet for that matter ;-)

It appears now that there are quite a few statistical problems with the statistical evidence (with or without the statistical calculation) on the basis of which she was convicted of murders and attempted murders. The following list corresponds point by point with the four deadly sins I catalogued above:

Lawyers and doctors find it difficult to grasp that rejecting the null hypothesis of pure randomness does not imply one must believe the alternative of murder (whether this is done formally by a statistical calculation or informally by just glancing at the numbers). Maybe Lucia had more night shifts? Maybe she was such a conscientious nurse, that she sometimes noticed a problem with a patient which actually was a false alarms? (At least one of the incidents in her shifts is exactly of this nature). The court looked summarily at a couple of such possible explanations, which had been mentioned by Elffers “by way of example”, but the court made no effort whatsoever to investigate others, mentioned by Meester, and one (time variation) mentioned implicitly by Elffers (he did not want to look at other time periods because of possible changes over time). As far as I know, no research has ever been done on variation between nurses, or on clustering and time variation in a ward. It is an axiom of the medical professsion that there is no difference between different nurses but the medical profession is not always so sharp in probability theory, cf. Sir Roy Meadow and the British cot deaths... In fact, nurses have a lot of influence on which shifts they work and even use this to avoid, or in other cases to experience, shifts which are expected to be particularly hard. We know now that in the weeks before September 4 one of Lucia’s supervisors was already suspicious of her and deliberately assigned her a difficult shift in order to see what would happen.

The statistical *calculation* is no longer part of the conviction
– the statistical community would not provide
*the* probability, despite repeated requests by the judge (!).
In fact the statisticians’ public dispute
has been so vociferous that the judge obviously felt it unsafe
to rely on any particular number. Consequently, statisticians no longer
take much notice of the case. However, the numbers,
which appear to speak for themselves (but what do they say?), do remain
a vital part of the conviction.

Incidentally, Lucia de Berk has throughout insisted on her innocence, a fact used against her to paint her as a pathological liar; the lack of any medical signs of poisoning shows what a refined killer she was.

The statistical profession in the Netherlands, myself included,
is deeply shamed that *no-one* from my profession has
really investigated how the data was actually gathered ...
It has taken five years, three court cases,
several scientific meetings, numerous articles in
newspapers and semi-professional magazines and professional journals,
before
medical doctor Metta Z. de Noo and her brother,
philosopher of science
Ton
Derksen from the University
of Nijmegen,
have gone deeply into that question ...

History of statisticians’ involvement: Elffers, later supported by Mr. R.V. de Mulder (specialist on law and IT) does the statistics for the investigation and later for the court. Henk Elffers holds a chair in psychology and law, but he has many interests and was initially trained as a mathematician and statistician. Prof. Ronald Meester (probabilist) and Prof. Michiel van Lambalgen (logician and cognitive scientist) become experts for the defense – feeling from the newspaper reports that something is wrong they offer their services of their own accord, and uncover many irregularities. Because they don’t think that Elffer’s null hypothesis is relevant, and they don’t like significance tests anyway, they do not put much emphasis on the fact that within Elffer’s chosen paradigm (and pretending the data is gathered unbiasedly) his statistical analysis is just wrong (in fact, I think they did not understand his analysis at all; their criticism certainly holds water, but would have been so much more devastating if it had been more professional and less academic and if they had made the tactical choice to argue a completely different much larger probability, rather than to argue that in general a statistical analysis was altogether impossible). More newspaper reports and talks lead to a public but in my opinion totally irrelevant debate between Bayesian and freqentist statisticians. I am the first to agree that “applied probability” or “empirical Bayes” calculations or thought experiments are highly enlightening; they too explode the prosecution’s case. Even under pessimistic assumptions on the proportion of serial murderers among the Dutch nursing population, the number of innocent nurses who would accidently be convicted of murder by statistics is much larger than the number of killers who would be caught. Even worse, allowing for “innocent” heterogeneity among nurse’s rates of experiencing incidents, Lucia’s numbers are not so extreme at all. Unfortunately the results of thought experiments are not allowed to be used in the courtroom. It is strange that science does progress in this way, but justice does not, yet both are in search of the truth.

The debate probably confirms the public’s
view that statistics is all lies and it does not do much credit to the
statistical profession. In retrospect the debate is doubly
irrelevant since no-one has the data and no-one knows how and why the data
was collected and no-one asks. The problem with post-hoc
testing of hypotheses is debated widely,
but most participants don’t realise that
the data is being used not just to prove Lucia’s guilt but simultaneously
to prove that murders are being committed. No-one suspects that
the data could have been essentially *fabricated* to establish a
statistical correlation, hence suspicion of foul play, hence murder,
and hence a murderer.
Throughout all this, the
appalling ignorance of probability and statistics in the legal and
medical professions is sometimes forgotten. Concerning scientific
evidence in general, it appears that
a Dutch judge can choose from
conflicting scientific opinions, picking even the minority opinion,
and whether on statistics, medicine or
psychology, with the only motivation that the choice
supports the (foregone) conclusion of the judge.
Though the statistical arguments find their way into the public
domain and eventually into the international – English language – domain,
the medical and psychological arguments remain largely hidden and certainly
not discussed in any international arena.

Let me end on an optimistic note.
Expert witness for the prosecution Henk Elffers always insisted
that his analysis only showed that the observed coincidence
could not be due to pure chance, not that Lucia caused the
the deaths. I have always agreed. Perhaps he is right in
a way he did not suspect.
If de Noo’s and Derksen’s findings are correct,
which I believe them to be, statistical
considerations, especially meta-statistical
considerations (considerations of
statistical sampling methodology – or
if you like, proper scientific procedures),
show that there is no way to
draw any support for the prosecution
from the statistical data that was
previously used to
convict Lucia de Berk of multiple murders.
I trust that lawyers will consider
*this* “new evidence”.
I refer here to the
“committee Posthumus II”.
They can only recommend reopening of any case if
tunnel-vision and misuse of scientific evidence by the police
and/or public prosecution has possibly
led to a miscarriage of justice.
It is apparently impossible that Dutch judges could
have failed in any way.

The above writing (original text 5 and 6 November 2006, since then, many times updated) was precipitated by an email to me (4 November) from computer scientist Peter Grünwald. Here is his eloquent (Dutch) analysis of the then situation.

Article by author Maarten ’t Hart, from NRC Handelsblad, October 7: A Modern-day Witch-trial (in Dutch). Similarity with the case of Beethoven’s great-grandmother, burnt to death on Brussels market square, because a cow died when she was in the neighbourhood (and then a couple more farmers reported the same had happened to them).

My earlier statistical contribution
to the debate is incorporated in the paper:
On the (ab)use of statistics
in the legal case against
the nurse Lucia de B
available at Cornell on arXiv.org, and
now appeared with discussion in *Law, Probability and Risk*,
joint with Marieke Collins, Michiel van Lambalgen, Ronald Meester;
somewhat overtaken by more recent events ...

Some statistical notes, R scripts, data and data-analysis; you might like to install the doubly free statistical package R (aka GNU S), and use copy and paste to learn some statistics for yourself. You will find in my notes, both the original three 2x2 tables analysed by Elffers, as well as the new three 2x2 tables presented to the three wise men of Posthumus II by Derksen; discussion on interpretation of the numbers; ideas for various analyses not yet done; various p-values. You can now see the original raw data as two (csv format) spreadsheets RKZ.csv, JKZ.csv. Here are graphical representations, RKZ.pdf and JKZ.pdf. The original spreadsheets had columns and rows reversed. I have them too, if anyone wants to go back to the true originals.

One in Nine innocent nurses go to jail , if they dress unconventionally. This new little paper (revised, joint with Piet Groeneboom) explores the consequence of innocent heterogeneity among nurses’ shifts: something which most nurses find almost obvious, most medical specialists find impossible. Another paper-in-preparation analyzes Elffers’ mistaken multiplication of p-values. The Mantel-Haenszel test is about strengths and weaknesses of this test - bread-and-butter in modern epidemiology and medical statistics, introduced about the same time as Henk Elffers quit mathematical statistics 30 years ago. That no statistician involved in this case had ever heard of this method tells us something about the amateurism which unfortunately also penetrates the medical and psychological evidence.

Here is Peter Grunwald’s recent talk (in Dutch) at CWI on statistical priestership. He studies in turn the three doors problem, the three prisoners problem, and the one prisoner problem (Lucia). Here is Willem van Zwet’s talk at Eurandom on statistics and law and Lucia. Van Zwet was closing speaker at the Dutch Statistical Society’s annual Statistics and OR Day, March 27, CBS, The Hague. I was the discussant. Links to newer material still, can be found on my homepage. I especially recommend Science vs. Justice, talk to the Dutch science journalists at
Henk Elffers has requested that his statistical reports
for the prosecution in the original court case be put
in the public domain. The request was honoured in the
sense that the the documents may now be shown to
interested scientists, but not to journalists.
So for interested scientists, here are links
to English translations of his reports.
The two short documents,
one of 7 pages and dated 8 May 2002, and one of
3 pages dated 29 May 2002, contain no sensitive
or personal information. All factual
information contained
in them is in fact already in the public domain.
Not yet public is de Mulder's testimony.
Henk admits that in retrospect, his presentation of
his conclusions is unfortunate and
could have mistakenly led the
reader to believe that his “one in 342 million”
should be interpreted as a p-value. He also regrets
that he gave way to pressure from the court
and adjusted the definition of “incident”
for legal reasons, and attempted to combine
the data from
the two hospitals JKZ and RKZ, which had
been gathered in very different circumstances
at quite different stages of the investigation.
He likes to remind all concerned, that the current
*formal* position of the Dutch courts is that

1) *shift-data statistics play no role whatsoever
in the sentencing of Lucia de B.*

2) *current jurisprudence forbids it, anyway*

He thinks the Dutch courts are very wise to
take these positions. Also,

*he (H.E.) thinks that
the case should be reopened*.

**Synopsis/reconstruction of the case**

**Petition for a retrial of the Lucia de Berk case**

I can’t resist adding a note concerning the irrelevant issue of Bayesian versus frequentist statistics: I wonder how many Bayesians would have included an a priori probability that the data was wilfully manipulated (with the best of intentions, of course), and what it would have been, and whether the court would have admitted this element of the statistical analysis. I am reminded of the saying “80% of statistics are just made up”.

Research shows that about half of the applications of statistics in scientific journals are seriously flawed. Amusingly though, the blunder of multiplying p-values is not commonly mentioned as one of the common mistakes. Of course one can multiply probabilities of independent events if one is interested in the probability of their simultaneous occurrence. But why should one be interested in that probability? Anyway, a p-value is a convenient transformation of test statistics, so that anyone can do the test according to their personal significance level. No more no less than that. Journalists and lawyers are typically very proud of themselves when they know about multiplication of probabilities under independence. Unfortunately a p-value is not a probability of a prechosen event and it doesn’t make sense to pretend it is. A letter to the editor of a Dutch popular science journal, pointing out this mistake, was not published because, as the editor said, “everyone knows you can multiply independent probabilities”.

A meta-analysis of the separate p-values from the various wards in which Lucia worked reminds me very of the research on paranormal phenomena. An initial publication shows a highly significant result. Attempts to duplicate the phenomenon tend to show up just significant results, e.g., just significant at the 5% level if one uses a one-sided test. A suspicion remains that some non-significant results are not published (they would not be accepted for publication anyway). The p-values of the published cases, after the first one, are remarkably independent of the size of the study. Thus, the more repetitions, the less strong is the effect. Presumably the subject gets tired, or his spirits get bored. Some ESP researchers (e.g., Dick Bierman) rather associate this with quantum theory, since though quantum theory does allow very unlikely events to happen, in the long run the averages have to settle down to the expectation values.

Return to my homepage ...

(Last updated: 28 May 2007).