Statistiek voor Sterrenkundigen (en natuurkundigen)

Astrostatistics


Lectures: Tuesdays 9:00--10:45 Huygens Lab 204

(except in first two weeks, i.e., Sept. 9 and 16: Friday 11:15--13:00)

Office hour (second half of semester): Fridays 11:00--13:00. My office is room 230, Snellius building. You might like to call me on my mobile phone before walking over there, just in case: 06 2242 7127


PRE-COURSE HOMEWORK: install the (free) statistical environment R on your laptop/computer at home or find it on your institute's computer systems (if it is not there, ask the system administrators to install it). Run through a simple getting-started tutorial to get familiar with its quirky style. 

You can find R at

http://www.R-project.org.

The course is based on material from the world famous Penn State University summer schools on statistics for astronomers (astrostatistics!). Please take a good look at the home page of latest summer school,

http://astrostatistics.psu.edu/su11scma5/su11program.html.

Download the course materials for the first day of the five day course. You will find there:
You can also find a huge pdf file with all the handouts of the whole summer school combined into one document.

We will work through all the basic material in depth, and take more superficial looks at some of the special topics towards the end of the course (November-December).

Part of your credit for the course will consist of assignments in R.

October-November

We move from probability to statistics. You now need the two lectures on inference from days 1 and 2, and the next R lab session for Astronomy. This refers to data-sets which you can find at http://astrostatistics.psu.edu/datasets

It is useful to be able to combine R code, results, and plots with LaTeX. The key tool for this is called Sweave. It is both a standard R package and a LaTeX style file, sweave.sty. Instructions from within R:  do
help("Sweave", package="utils").
Typically, while working in R and documenting what you are doing as you do it, you build up a so-called Rnw ("R no web") file, it is a plain text file which contains alternating snippets of LaTeX code and R code. 
More precisely: it looks like a normal latex file, with pieces of R code marked as follows:
<<option=value, option=value>>=
x <- y      # typical line of R code
...
plot(z)     # typical line of R code
@
You load this file into R by the command
Sweave("yourfile.Rnw").
R will perform all the R snippets, and according to your wishes, as indicated in the options for each snippet, the R code snippets themselves, their results, and references to plotted images (pdf files, for instance), will be combined in an output yourfile.tex file. Now run LaTeX on this file in the usual way to generate a beautiful report of your work.

An example by myself is histogram.Rnw. Try it out!

Your final result should be absolutely identical to mine, histogram.pdf.

Books.

Two books can be considered as basis of the course (I will be consulting both of them a lot). I highly recommend you get both. If you can only afford one, go for Wall and Jenkins this time. There are eBook editions of both.

J.V. Wall and C.R. Jenkins, Practical Statistics for Astronomers (2003), CUP
John A. Rice, Introduction to Mathematical Statistics and Data Analysis, 3rd edition (2007), Duxbury

The structure of the Penn State course is very close to that of Wall and Jenkins book. My lectures are often unstructured and occasionally I miss important things. Reading Wall and Jenkins, you will be sure not to have missed anything. However they are often careless on mathematical details and say things which are not really true. Be warned! That's where Rice can come in - that is generally speaking closer to the truth.

I will be giving you some theoretical exercises from Rice's book, which we also use over at the Mathematics Department in a lot of our basic teaching of probability and statistics for mathematicians. And when Wall and Jenkins say that something can be calculated or proved, Rice will probably actually do the calculation or outline the proof explicitly.

I will be trying to learn some astronomy from the PSU summer school material, from Wall and Jenkins, and from you.

Exercises based on the slides for the lectures on probability.

The first collection of exercises come (almost all) from the slides "prob_lectures.pdf" (Laws of Probability, Bayes' Theorem, and the Central Limit Theorem). You will be able to find solutions to many of them in the actual text of Rice's book. There are lots of excellent exercises for you also in Wall and Jenkins. Try them! The structure of the book follows closely the structure of the course (or vice versa). If you get stuck in mathematical details, you will probably find exactly what you need in Rice's book as a worked example in the text.

1. (p. 16). Construct a discrete sample space for the experiment "roll a fair die till you get your first six". What is the probability that you have to throw the die for ever (and never ever get a six)?

2. (p. 27: matching problem). A large number n of persons randomly permute their hats among themselves. Show that the probability that exactly one person get's their own hat is approximately exp(-1). What can you say about the probability that exactly k persons get their own hat? Do you recognise this probability distribution? Does it make sense?

3. (p. 40: Bayes' rule). Show that for any events H, K and E, P(H|E):P(K|E) = P(H):P(K)  times P(E|H):P(E|K). In words: posterior odds equals prior odds times likelihood ratio (or Bayes' factor). Use this result to answer the question: in the three doors problem, what is the conditional probability that the car is behind door 2, given that the player chose door 1 and the host opened door 3 to reveal a goat? What assumptions did you make?

4. (p. 46). Show that Var(X)=E(X2)-E(X)2

5. (p. 47). Compute Var(X) (a) when X takes the values +1, -1 each with probability 1/2; when X takes the values +100, -100 each with probability 1/2.

6. (p. 48). Suppose U is uniformly distributed on the interval (0,1), i.e., for any subinterval (a,b), P(a < U < b)=b-a. Define V=ceiling(-log(U)/log(2)). Show that P(V= k) = 2-k, k=1,2,... . Therefore, X=2V takes the values 2, 4, 8 ... with probabilities 1/2, 1/4, 2/8 ... What is E(X)? Use R to study the behaviour of the average of n independent copies of X as n becomes large. Hint: you might find the functions runif, floor, cumsum useful.

7. Bell's inequality.
(a) Suppose X1, Y1, X2, Y2 are four random variables with take the values +1 and -1 only. Note that (X1Y1).(Y1X2).(X2Y2.(Y2X1)=1 whatever the values taken by the four variables X1, Y1, X2, Y2. Deduce that Y2X1cannot be smaller than X1Y1+Y1X2+X2Y2-2.
Hint: consider the two cases Y2X1=+1 and -1 separately and use the fact from the identity which you found earlier, that the number of minus signs in the four products must be even.
Deduce that E(X1Y1)+E(Y1X2+E(X2Y2)-E(Y2X1) cannot exceed 2.
(b) Compare this inequality to the prediction of quantum mechanics when measuring polarization of two entangled photons measured in directions ai and bj respectively, i, j = 1, 2. When photon 1 is measured in direction i it either passes through the polarization filter or not; we encode this with a +1 or -1 and call the resulting random variable Xi. Similarly Yj denotes the outcome encoded +/-1 for photon 2. The two measurement directions for photon 1 are 0  and 45 degrees; those for photon 2 are 22.5 and 77.5 degrees. Each photon separately, whatever the direction, goes through the filter with probability half. Notice that for three of the combinations, the two photons are being measured at angles which differ by 22.5 degrees; in just one combination, the difference is 77.5 degrees.
According to quantum optics, the probability the two photons "do the same" is the square of cos(ai-bj). It is large in three cases and small in just one. What is E(XiYj )in each of the four cases (pairs of angles)? Compare to Bell's inequality: surprise?

8. (p. 49). If X has the binomial distribution with parameters n and p, prove that E(X)=np and var(X)=np(1-p). Hint: you might find it useful first to compute E(X) and E(X(X-1)), observing that E(X(X-1))=E(X2)-E(X).

9. (p. 49). Download and read: A. Meszaros, On the role of Bernoulli distribution in cosmology, Astron. Astrophys., 328, 1-4 (1997). Are you convinced?

To be continued...

But you can also create your own exercises in this way, don't wait for me.  Every time the slides say "one can show that" go ahead and do it. Every time. When you can't do the calculationsyourself, consult with Rice. And when there is a reference to an astronomy paper, get the paper and see what you think. When the paper is unavailable, ask me. I have already paid good money for a couple of what turned out to be completely worthless papers, so at the least, to have some benefit from this, I can share them with you by email.


Exercise on conditional probabilities:
The definition of P(A|B), when thought of as a function of all possible events A, defines a new probability measure on our original sample space. We started with a particular probability measure P( . ) and from that got a new one, P( . |B).  One sets all probability outside of B to zero and renormalizes probability inside B so as to retain total mass 1.

Denote by PB the conditional probability measure "P given B" got from a given initial probability measure P and a given event B. Write A.B for "A intersection B". Prove that (PB)A = PA.B

This justifies the common notation P(A|B,C) which can either be read as conditional probability got by conditioning first on B, then on C (or first on C, then on B); and also as the one-step conditional probability P(A|B.C).


Useful tricks for getting expectation values: generating functions

The probability generating function g(t) of a random variable X which takes nonnegative integer values only, is the function EtX. Differentiating k times and putting t equal to zero gives the probability that X=k, hence the name. Differentiating k times and putting t equal to 1 gives the expectation value of E(X(X-1)...(X-k+1).

The moment generating function m(t) of a random variable X is the function EetX. Differentiating k times and putting t equal to zero gives the expectation of Xk, hence the name.

But - of course! - neither of these functions need exist since the expectations concerned might be infinite or worse. A lot of deep probability theory is done using the characteristic function phi(t)=EeitX which always exists. In fact is is simply the Fourier transform of the probability density or mass function. Since it always exists and the original distribution can be recovered by inverse Fourier transform, it characterizes the distribution entirely, hence the name.

Exercise on generating functions:
Supposing all generating functions of a particular random variable X exist and are nice enough, show that each can be computed from the other by appropriate change of variable (or analytic continuations). Compute them all when X is Poisson distributed with parameter lambda.


Exercises from the first two chapters of Wall and Jenkins:

If you don't have the book yet, download Wall-Jenkins-exercises-1-2.pdf (including some excerpts from chapter 2).

Try especially the exercises from chapter 2; especially the first of these (use R). And try the exercises on updating beliefs about rho, the rate of supernovas, using Bayes' theorem and various prior distributions (excerpts from the text on this topic included in the just-mentioned pdf).

Note on Bayes' rule

In the class-room we saw the key result
Prob(Theory | Data)   is proportional to   Prob(Data | Theory)  times  Prob(Theory)
or in words
posterior   is proportional to   likelihood  times  prior
The same rule holds for probability densities and probability mass functions and even combinations of the two.
When we say A is "proportional to" B we mean that the left hand side, A, equal a constant times the right hand side, B, as something varies, other things being fixed. (The "constant" could well depend on the "other things"). What is supposed to vary and what is supposed to be fixed depends very much on context. Here, of course, the thing that may vary is Theory, the other thing that is fixed is Data.

For instance, in the supernova exercises of Wall and Jenkins chapter 2, the posterior probability density of the supernova rate rho is proportional to the discrete probability of observing precisely X=x supernovas, given rho (that is what we call the likelihood for rho coming from the data x); times the prior probability density of rho. Having observed X, the value which we actually saw (e.g., 4 in 10 centuries), x, is fixed. What we think of as varying here are hypothetical values rho of the true, underlying, actual, rate of naked-eye-visible supernovas throughout all human history, past and future. Which of course is unknown. We'ld have to watch the sky for ever in order to determine it. But having seen 4 supernovas in 10 centuries would seem to tell us something about it.



Final note on notation: one might write both p(x;rho) and p(x|rho) for a probability mass function (a function of x) of a random variable X, given the value of some parameter rho. The latter notation is suggestive of a conditional probability density. And from a Bayesian perspective that is precisely what it is. I use either notations without really noticing what I am doing. And different authors may other notations too, for instance prho(x).

If we observe data x in order to learn about an unknown parameter rho then p(x|rho) , considered as a function of the unknown parameter rho for given data x, is called the likelihood function. We might then write lik(rho|x)=p(x|rho). Likelihoods only ever make sense up to proportionality, they are used to compare different values of the parameter for the same data. So a multiplicative factor which is constant from this point of view can be simply deleted and the result is also still called "the" likelihood function. It's a function of rho, the parameter.

For many reasons the logarithm of the likelihood is very very useful. A common notation is l(rho)=log lik(rho) for given data. An additive constant can be deleted. (That can be a constant which depends on the data!)

likelihoods and log likelihoods are functions of the possible values of the unknown parameters. Since they also do depend on the data (after all, the data is telling us about the parameter, through the likelihood, hopefully!) we can think of them "before actually observing the data" as being random functions (one random variable for each potential parameter value), just as the data itself is a collection of random variables.

Another R Exercise: pi in the sky

Do this on your own computer. Do it also with a few different seeds to the random number generator.
What is going on?
Can you get results for N = 100 million, 1000 million, ... ?
How far can you go?
Do you prefer pi1 or pi2? Why?
Error bars? By theory, or empirically?
Think of another way to estimate pi experimentally (by computer simulation). Which method is better?

> set.seed(11091951); N<-100
> for (i in 1:10) {
+ N<-10*N
+ data<-runif(N)
+ data2<-c(data,1-data)
+ print(list(N=N,pi1=4*mean(sqrt(1-data^2)),pi2=4*mean(sqrt(1-data2^2)))) }

$N [1] 1000 $pi1 [1] 3.144928 $pi2 [1] 3.133383
$N [1] 10000 $pi1 [1] 3.152252 $pi2 [1] 3.141676
$N [1] 1e+05 $pi1 [1] 3.139636 $pi2 [1] 3.140086
$N [1] 1e+06 $pi1 [1] 3.142816 $pi2 [1] 3.141791
$N [1] 1e+07 $pi1 [1] 3.141426 $pi2 [1] 3.141593
Error: cannot allocate vector of size 1.5 Gb

First Assignment

This file LMC.csv contains the results of determination of the distance to the Larger Magellanic Cloud according to a large number of different methods, from a paper by Clementini et al. published in Astronomical Journal, eprint at arXiv:astro-ph/0007471. This is a link to a pdf of the arXiv eprint. The data was taken from Table 6 of the paper, and it is visually represented in panel (b) of Figure 8 (the last figure) of the article. The data is also briefly discussed in the summer school lecture notes "day2.pdf" Loading, summarizing and visualizing data.

Your task: estimate the distance to LMC and quantify the accuracy of your estimate. Compare the results obtained using two different models. Model 1: Gaussian: each measured distance is equal to the true unknown distance plus an independent Gaussian error with mean zero and variance sigma-squared, where you may take sigma (different for each method) equal to the values ("Error") given by Clementini et al. Model 2: Laplace, aka double exponential. The probability density of the error is exp(-|error|/sigma)/2 sigma where again you may take sigma equal to the "Error" values given by Clementini et al.

Some points you may wish to pay some attention to: is there a difference between the distance based on "population 1" and on "population 2"? Is it really reasonable to think of the error of each distance measurement as being independent of one another? Note in this connection that there are some subgroups of measured distances in which different methods are being applied to the same subpopulation of stars (see last column of the spread-sheet data and Clementini et al). Finally, which of the two statisticial models, if either at all, do you think is the most appropriate for this analysis?

Some remarks on R: R is a scripting language; when you run R programs, each line is translated into binary code and executed if and when it is executed. Every line in a "do" loop is translated (compiled) every time anew, on each iteration of the loop. Don't do do loops, if you can avoid them! On the other hand, R has many functions which operate on (multi) arrays and sub-arrays. Look at examples from the R tutorials in the summer-school to get inspiration. Some powerful functions are: outer(), which creates the outer-product, or outer-anything, from two arrays; apply(), which applies a function to sub-arrays of an array. So outer() creates objects with a larger number of dimensions from those with which you started; apply() reduces numbers of dimensions; functions like sum(), max() convert arrays to numbers, reducing the number of dimensions to zero (to a scalar).

Here are some example R scripts:

randomwalk.R is my solution to the coin-tossing game problem: the first exercise from Chapter 2 of Wall and Jenkins. My apologies that there are no comments in the code! Run subsequent pieces of it at a time by copy and paste, watch what happens, find out what has happened. There are a number of plotting calls, each new call throws away the previous picture, so if you copy and paste the whole thing to the R console you'll only see the last plot.

The problem: Two people are playing heads and tails. We are interested in the number of heads minus the number of tails as function of the number of coin tosses so far. Player 1 is in the lead when this is positive, player 2 is in the lead when this is negative. Altogether they do N tosses, where N is large. During the course of the game the lead typically changes a number of times. Roughly how often do you guess this would be, as function of N? (e.g., proportional to N?). And when, roughly, do you expect that the lead changes for the last time? (e.g., around halfway?).

cauchylik.Rnw is a Rnw file (interleaved R and latex: i.e., a self-documenting R script) with which the pdf file cauchylik.pdf can be produced. It tells a story of estimating the so-called Cauchy distribution with maximum likelhood.

bootstrap.R is an R script for doing a plain and a studentized bootstrap of the mean, and reporting 95% confidence intervals


Second assignment

My apologies that this is taking longer to prepare for you than I had hoped. I believe I am now nearly there (noon, 20 Dec, 2011). The assignment is going to concern the distribution of the luminosity of globular clusters. Here are three articles which provide background for the subject: van den Bergh (1985), Secker (1992), Secker and Harris (1993). Data from these examples is analysed in the Penn State summer school lecture notes on statistical inference: Inf1.pdf, Inf2.pdf. I have already been able to convert one of the data sets of these papers to suitable format: the data of van den Bergh's Table 1. This data is used as an illustration in Inf1.pdf. Data of Secker (1992) is used as an illustration in Inf2.pdf.

I am going to ask you to fit t-distributions to two of these data sets, corresponding to two different populations of globular clusters, and to describe the difference between the two populations as succinctly as possible. Note that the second just mentioned lecture notes, Inf2.pdf,  does do the estimation job on a sample from one population of globular clusters, data from Secker (1992).  You can find out about the t-distribution in Inf2.pdf and in Secker (1992). It is one of the standard distributions in R. The t-distribution has a shape, a scale and a location parameter. The shape parameter allows it to interpolate between Gaussian at one extreme, and Cauchy at the other extreme.

If you fit the parameters of the distribution to two data-sets with maximum likelihood, you can estimate standard deviations of your estimates by looking at minus the inverse of the 3x3 matrix of second partial derivaties of the log likelihood at its maximizer. Large sample theory of maximum likelihood estimators tells us that the maximum likelihood estimates are approximately Gaussian distributed around the truth with a covariance matrix which can be estimated by the negative inverse of this matrix of second derivatives. In particular, there are three estimated variances on the diagonal. Now you can compare two estimated parameters, one based on one sample, one based on the other, by the fact that the difference of two independent normal distributed variables is also normal, with variance equal to the sum of the variances of the two separate variables.

Because each sample in question can be thought of as sample of i.i.d. (independent and identically distributed) observations from one distribution, you can use a standard function in R for fitting a given distribution, by maximum likelihood, to an iid sample. Load the package MASS by the command library(MASS) and take a look at the function fitdistr(), so of course the first thing you should do is help(fitdistr).

Later today (or early tomorrow morning) I hope to be able to give some more instructions, in particular, which two populations of globular clusters are to be compared! (One possibility is going to be two of the samples which are actually subsets within van den Bergh's first table.

Update (assignment 2)

Now I have finished the data-preparation for this second assignment. Three files contain the data "V" (magnitudes) from three classes of  M31 Globular Clusters: those with B - V smaller than 0.70, those with B - V between 0.70 and 1.00, and those with B - V larger than 1.00.  I call those three files
vdb_table_1_low.csv,
vdb_table_1_med.csv,
vdb_table_1_high.csv
They contain respectively 65, 188 and 118 observations. Read these three data-sets into R, calling them for instance low, med, high. Then the command boxplot(low,med,high) gives a quick visual summary of the three sets of magnitudes. You will see that the sample with low B - V is clearly different from the other two, but that it is not so clear if there is a strong difference between those with medium and high B - V. Van den Bergh gives a rather sensible reason why the globular clusters with low B - V are actually a biased sample from the "true" population of such clusters: clusters which are very faint with respect to the background are missed altogether. This is an important phenomenon called "random truncation" which we could not deal with in this introductory course, but which is a severe complication to many astrostatistical analyses.

Notice that the function fitdistr() in the MASS library in R knows about the t distribution. In fact, help(fitdistr) even gives an example fitting t-distributions to simulated data in two different scenarios. To explain these scenarios I must  say a little more about what the t-distribution is. The (standard) t-distribution with k degrees of freedom is the distribution of

a standard Gaussian
divided by
the square root of
     1 divided by k     times     the sum of k more squares of independent, standard Gaussians.

At least, that defines the distribution for integer values of k. However the corresponding probability density can be very sensibly defined also for non-integer k, so from now on k is a real positive parameter. Notice that as k becomes larger and larger, the denominator becomes more and more constant and equal to 1 (by the law of large numbers). Therefore for very large k, the distribution is close to a standard Gaussian. For k=1 the distribution is the same as the distribution of the ratio of two independent standard Gaussians. This is one way to define the so-called Cauchy distribution, which has very heavy tails (the probability density goes to zero like 1 divided by x squared).

k is often called the "degrees of freedom", and from now on I will denote it by "df" instead of by "k".

In our situation we want to introduce two more parameters, which we will call a location and a scale parameter. Let's denote them by mu and sigma respectively. Then by definition, the three parameter t-distribution, with parameters mu, sigma and df, is the probability distribution of mu + sigma * T, where mu is a real number (the location parameter), sigma is a positive real number (the scale parameter), and T has a standard t distribution with df degrees of freedom (the shape parameter).

Your task:
(1) fit three-parameter (location, scale, shape) t-distributions to the three samples.
(2) Discuss how well these distributions fit the data; and
(3) compare the samples by comparing corresponding parameter values.

I would recommend the use of QQ-plots (empirical data versus a theoretical fitted t-distribution) to evaluate "by eye" how well the model is fitting.

Do help(fitdistr) to see examples of the estimation part, and take note of the various values returned by the function fitdistr.

There are two main ways to compare two distributions with some parameters the same, some different. One way is to simply estimate the parameters of the two models on the two datasets completely separately, compute estimated standard errors, and compare the difference between parameter values to the corresponding standard error of the difference (equals the square root of the sum of the variances). Another way is to fit to the data of two samples two different models: one in which all parameters are free, and another in which certain of the parameters are fixed equal to one another. Naturally, the maximized log likelihood of all the data is larger when all parameters are free than when some of them are constrained to be equal to one another. By standard theory of maximum likelihood estimators and generalized likelihood ratio tests, under the null-hypothesis that the constrained model is actually true, the difference between the two maximized log likelihoods, times two, is approximately distributed as a chi-square random variable with degrees of freedom equal to the number of independent constraints involved in going from the general situation to the constrained model. And the chi-square distribution with "df" degrees of freedom is the distribution of the sum of that many squares of independent standard Gaussians.

The second way is more complicated but likely to be more accurate (i.e., the approximation to chi-square distribution etc will be more reliable at smaller sample sizes already, than the approximaton of the test statistic to standard Gaussian obtained from the first way).

The intuition behind that is that the more free parameters we have over which to optimize twice the log likelihood, even when these parameters are superfluous, the larger the likelihood can be purely by chance; it turns out that the excess obtainable by merely capitalizing on random variation has approximately this chi-squared distribution. On the other hand, if the extra parameters are not superfluous, but really are needed, then the difference between twice maximized log likelihoods tends to be a whole lot larger, since the constrained model actually fits the data a whole lot worse than the unconstrained model.

Tentamen

Bij het opstellen van tentamen vragen zal ik mij laten inspireren door oude tentamens van de twee wiskunde colleges Kansrekening en Statistiek 1, Kansrekening en Statistiek 2 (eerste en tweede jaars respectievelijk), waaruit de nieuwe college gegroeid is. Voor dit jaar was het namelijk gangbaar dat astronomie studenten alleen de tweede van die twee colleges volgden, wat de voor de hand liggende problemen veroorzaakten... Maar vanaf dit jaar hebben jullie een eigen college die beide onderdelen bevat, met nadruk op de praktijk van de astronomie.

U kunt oude tentamens inzien op de twee webbladzijden wiskunde tentamens KS1 en wiskunde tentamens KS2

Natuurlijk zijn er een aantal onderwerpen in de wiskunde colleges, die bij jullie niet aan de orde zijn gekomen (bijvoorbeeld, Markov ketens), en zijn er meer wiskundige details en achtergronden geweest. Maar wat de tentamens betreft constateer ik weinig verschil in (door mij gewenste) niveau en inhoud. Ik zal mijn best te focussen op de onderwerpen die ik tijdens de college als centraal naar voren heb gebracht, en de wiskunde redelijk elementair te houden. Focus op de eenvoudiger opgaven bij de oude tentamens KS1 en KS2. En de onderwerpen die in de, ruwweg, eerste helft van de Penn State summer-school aan de order kwamen: kansrekening en "inference".

De twee wiskunde colleges focussen respectievelijk op de kansrekening en de statistiek, waarbij de kansrekening een onmisbaar grondslag vormt voor statistiek. Jullie hebben harder moeten werken aan praktische statistiek door middel van de twee computer opdrachten, dan gewoonlijk is bij de wiskunde studenten.

De tentamen zal open boek zijn. Je kunt fotokopieen en aantekeningen meenemen.