Statistiek voor Sterrenkundigen (en natuurkundigen)

Astrostatistics

This year (Autumn 2012) we used the brand new CUP book by Eric D. Feigelson and G. Jogesh Babu, Modern Statistical Methods for Astronomy (with R applications), instead of the material from the website http://astrostatistics.psu.edu/su12/su12program.html of the same authors' world famous Penn State summer schools on statistics for astronomers (astrostatistics!), on which their new book is based and which we used in 2011. You may find this material useful in combination with the book. Some students may appreciate a more formal and extensive mathematical treatment. For this purpose I recommend Mathematical Statistics and Data Analysis by John A. Rice.

Lectures and "werkcollege":

Tuesdays 9:00--10:45 and Fridays 9:00--10:45, Huygens Lab 226
Office hour: Tuesday afternoons . My office is room 213, Snellius building. You might like to call me on my mobile phone before walking over there, just in case: 06 2242 7127

Pre-course homework

(1) Install the (free) statistical environment R on your laptop/computer at home or find it on your institute's computer systems. You can find it at http://www.R-project.org. Run through a simple getting-started tutorial to get familiar with its quirky style.
(2) Get a copy of the book, Feigelson and Babu (2012).
(3) Print out the first collection of "werkcollege" exercises, StAN_exercises_1.pdf.

If you have to wait before you can get the book, it wouldn't be a bad idea to take a look at the home page of the latest summer school, http://astrostatistics.psu.edu/su12/su12program.html. In particular the course materials for the first day of the five day course are interesting, they consist of:
We'll be covering the same material in the first weeks of our course. We go on to work through all the basic material (summer school/book) in depth, and maybe take more superficial looks at some of the special topics towards the end of the course. The basic material on statistics at the summer school is contained in the two sets of slides on Inference.
As a supplement, here are slides by me on the definition of distribution of a random variable, density and mass function, distribution function and quantile function, with two examples from R.

Werkcollege

First collection of "werkcollege" exercises, StAN_exercises_1.pdf.
Second collection of "werkcollege" exercies, StAN_exercises_2.pdf.
Third collection of "werkcollege" exercies, StAN_exercises_3.pdf.
Fourth collection of "werkcollege" exercies, StAN_exercises_4.pdf.
Fifth collection of "werkcollege" exercies, StAN_exercises_5.pdf. This is more of a small R project than a classroom exercise. It is related to notes I have written on fitting the Schechter law (gamma distribution) and the Salpeter law (Pareto distribution), see below, inspired by corresponding sections in Chapter 4 of Feigelson and Babu. In class I walked you through the R script gammafit.R which compares maximum likelihood and minimum chi-square estimation for the gamma/Schechter case. You might find it useful for exercise sheet 5, the Pareto/Salpeter case.
Sixth collection of "werkcollege" exercies, StAN_exercises_6.pdf.

Assignments

Part of your credit for the course (30%, to be precise) will consist of two assignments in R. It is useful to be able to combine R code, results, and plots with LaTeX. The key tool for this is called Sweave. It is both a standard R package and a LaTeX style file, sweave.sty. To see instructions on using Sweave from within R (i.e., while working at the R command line), do help("Sweave", package="utils").

Typically, while working in R and documenting what you are doing as you do it, you build up a so-called Rnw ("R no web") file, it is a plain text file which contains alternating snippets of LaTeX code and R code. More precisely: it looks like a normal latex file, with pieces of R code marked as follows:

<<option=value, option=value>>=
x <- y      # typical line of R code
...
plot(z)     # typical line of R code
@

You load this file into R (working from the R command line) by the command Sweave("yourfile.Rnw"). R will perform all the R snippets. Moreover, according to your wishes as indicated in the options for each snippet, the R code in the snippets, their results, and references to plotted images (pdf files, for instance), will be combined in an output yourfile.tex file. Now run LaTeX on this file in the usual way to generate a beautiful report of your work. A rather simple example by myself is histogram.Rnw. Try it out! Your final result should be absolutely identical to mine, histogram.pdf.
Here are two, more substantial, astronomy examples: fitting the Salpeter function (Salpeter Initial Mass Function) and fitting the Schechter function (Schechter Luminosity Function).
stanSalpeter.Rnw,stanSalpeter.pdf
stanSchechter.Rnw,stanSchechter.pdf.
You'll find discussions of both functions in Chapter 4 of Feigelson and Babu; and articles on both on Wikipedia.

Literature

The course is primarily based on the the new book by Eric Feigelson and Jogesh Babu, Modern Statistical Methods for Astronomy, Cambridge University Press (2012). This is a practice-oriented book, with many examples, pictures, intensive use of R. During the semester we managed more or less to work through the first four chapters of the book.

The mathematics behind the methods discussed in the book is explained in detail in John A. Rice, Mathematical Statistics and Data Analysis, 3rd edition (2007), Duxbury. Roughly speaking, chapters 1 to 5 of Rice's book develop probability theory systematically; chapters 8, 9 and 10 cover the core concepts in statistics. In our course we focus on practice, not on mathematical theory, and we focus on statistics, not probability, so in this course we do not systematically develop the probability theory behind statistics, and we are interested in how to use the mathematical theorems of statistical theory (which are the justification for statistical practice), not in how to prove them.

As well as these two books, I would like to mention the classic J.V. Wall and C.R. Jenkins, Practical Statistics for Astronomers (2003), CUP

Finally, it is worth knowing that Feigelson and Babu's book is closely based on their famous serious of summer schools at Penn State University. The site of the summer school contains a lot of course material (lecture notes, data sets...) and for some of you, possibly the lecture notes of the summer school will be more useful than the book, or very useful alongside the book. In its turn, those lecture notes were clearly inspired by Wall and Jenkin's classic book.

First Assignment

This file LMC.csv contains the results of determination of the distance to the Larger Magellanic Cloud according to a large number of different methods, from a paper by Clementini et al. published in Astronomical Journal, eprint at arXiv:astro-ph/0007471. This is a link to a pdf of the arXiv eprint. The data was taken from Table 6 of the paper, and it is visually represented in panel (b) of Figure 8 (the last figure) of the article. The data is also briefly discussed in the summer school lecture notes "day2.pdf" Loading, summarizing and visualizing data.

Your task: estimate the distance to LMC and quantify the accuracy of your estimate. Compare the results obtained using two different models. Model 1: Gaussian: each measured distance is equal to the true unknown distance plus an independent Gaussian error with mean zero and variance sigma-squared, where you may take sigma (different for each method) equal to the values ("Error") given by Clementini et al. Model 2: Laplace, aka double exponential. The probability density of the error is exp(-|error|/sigma)/2 sigma where again you may take sigma equal to the "Error" values given by Clementini et al.

Some points you may wish to pay some attention to: is there a difference between the distance based on "population 1" and on "population 2"? Is it really reasonable to think of the error of each distance measurement as being independent of one another? Note in this connection that there are some subgroups of measured distances in which different methods are being applied to the same subpopulation of stars (see last column of the spread-sheet data and Clementini et al). Finally, which of the two statisticial models, if either at all, do you think is the most appropriate for this analysis?

Second assignment

The assignment concerns the distribution of the luminosity of globular clusters. Here are three articles which provide background for the subject: van den Bergh (1985), Secker (1992), Secker and Harris (1993). Data from these examples is analysed in the Penn State summer school lecture notes on statistical inference: Inf1.pdf, Inf2.pdf. Your assignment will actually be based on the data of van den Bergh's Table 1. This data is used as an illustration in Inf1.pdf. Data of Secker (1992) is used as an illustration in Inf2.pdf.

The data in van den Bergh's table 1 can be split into three separate data-sets corresponding to three different populations of globular clusters. These three populations correspond to low, medium, and high values of B-V. I am going to ask you to fit t-distributions to these three data sets, and to describe the difference between the three populations as succinctly as possible. Note that the second just mentioned lecture notes, Inf2.pdf, does do the estimation job on a sample from one population of globular clusters, data from Secker (1992). You can find out about the t-distribution in Inf2.pdf and in Secker (1992). It is one of the standard distributions in R. The t-distribution has a shape, a scale and a location parameter. The shape parameter allows it to interpolate between Gaussian at one extreme, and Cauchy at the other extreme.

If you fit the parameters of the distribution to two data-sets with maximum likelihood, you can estimate standard deviations of your estimates by looking at minus the inverse of the 3x3 matrix of second partial derivaties of the log likelihood at its maximizer. Large sample theory of maximum likelihood estimators tells us that the maximum likelihood estimates are approximately Gaussian distributed around the truth with a covariance matrix which can be estimated by the negative inverse of this matrix of second derivatives. In particular, there are three estimated variances on the diagonal. Now you can compare two estimated parameters, one based on one sample, one based on the other, by the fact that the difference of two independent normal distributed variables is also normal, with variance equal to the sum of the variances of the two separate variables.

Because each sample in question can be thought of as sample of i.i.d. (independent and identically distributed) observations from one distribution, you can use a standard function in R for fitting a given distribution, by maximum likelihood, to an iid sample. Load the package MASS by the command library(MASS) and take a look at the function fitdistr(), so of course the first thing you should do is help(fitdistr).

Three files contain the data "V" (magnitudes) from three classes of  M31 Globular Clusters: those with B - V smaller than 0.70, those with B - V between 0.70 and 1.00, and those with B - V larger than 1.00.  I call those three files
vdb_table_1_low.csv,
vdb_table_1_med.csv,
vdb_table_1_high.csv
They contain respectively 65, 188 and 118 observations. Read these three data-sets into R, calling them for instance low, med, high. Then the command boxplot(low,med,high) gives a quick visual summary of the three sets of magnitudes. You will see that the sample with low B - V is clearly different from the other two, but that it is not so clear if there is a strong difference between those with medium and high B - V. Van den Bergh gives a rather sensible reason why the globular clusters with low B - V are actually a biased sample from the "true" population of such clusters: clusters which are very faint with respect to the background are missed altogether. This is an important phenomenon called "random truncation" which we could not deal with in this introductory course, but which is a severe complication to many astrostatistical analyses.

Notice that the function fitdistr() in the MASS library in R knows about the t distribution. In fact, help(fitdistr) even gives an example fitting t-distributions to simulated data in two different scenarios. To explain these scenarios I must  say a little more about what the t-distribution is. The (standard) t-distribution with k degrees of freedom is the distribution of

a standard Gaussian
divided by
the square root of
     1 divided by k     times     the sum of k more squares of independent, standard Gaussians.

At least, that defines the distribution for integer values of k. However the corresponding probability density can be very sensibly defined also for non-integer k, so from now on k is a real positive parameter. Notice that as k becomes larger and larger, the denominator becomes more and more constant and equal to 1 (by the law of large numbers). Therefore for very large k, the distribution is close to a standard Gaussian. For k=1 the distribution is the same as the distribution of the ratio of two independent standard Gaussians. This is one way to define the so-called Cauchy distribution, which has very heavy tails (the probability density goes to zero like 1 divided by x squared).

k is often called the "degrees of freedom", and from now on I will denote it by "df" instead of by "k".

In our situation we want to introduce two more parameters, which we will call a location and a scale parameter. Let's denote them by mu and sigma respectively. Then by definition, the three parameter t-distribution, with parameters mu, sigma and df, is the probability distribution of mu + sigma * T, where mu is a real number (the location parameter), sigma is a positive real number (the scale parameter), and T has a standard t distribution with df degrees of freedom (the shape parameter).

Your task:
(1) fit three-parameter (location, scale, shape) t-distributions to the three samples.
(2) Discuss how well these distributions fit the data; and
(3) compare the samples by comparing corresponding parameter values.

I would recommend the use of QQ-plots (empirical data versus a theoretical fitted t-distribution) to evaluate "by eye" how well the model is fitting.

Do help(fitdistr) to see examples of the estimation part, and take note of the various values returned by the function fitdistr.

There are two main ways to compare two distributions with some parameters the same, some different. One way is to simply estimate the parameters of the two models on the two datasets completely separately, compute estimated standard errors, and compare the difference between parameter values to the corresponding standard error of the difference (equals the square root of the sum of the variances). Another way is to fit to the data of two samples two different models: one in which all parameters are free, and another in which certain of the parameters are fixed equal to one another. Naturally, the maximized log likelihood of all the data is larger when all parameters are free than when some of them are constrained to be equal to one another. By standard theory of maximum likelihood estimators and generalized likelihood ratio tests, under the null-hypothesis that the constrained model is actually true, the difference between the two maximized log likelihoods, times two, is approximately distributed as a chi-square random variable with degrees of freedom equal to the number of independent constraints involved in going from the general situation to the constrained model. And the chi-square distribution with "df" degrees of freedom is the distribution of the sum of that many squares of independent standard Gaussians.

The second way is more complicated but likely to be more accurate (i.e., the approximation to chi-square distribution etc will be more reliable at smaller sample sizes already, than the approximaton of the test statistic to standard Gaussian obtained from the first way).

The intuition behind that is that the more free parameters we have over which to optimize twice the log likelihood, even when these parameters are superfluous, the larger the likelihood can be purely by chance; it turns out that the excess obtainable by merely capitalizing on random variation has approximately this chi-squared distribution. On the other hand, if the extra parameters are not superfluous, but really are needed, then the difference between twice maximized log likelihoods tends to be a whole lot larger, since the constrained model actually fits the data a whole lot worse than the unconstrained model.

Tentamen

Tentamen stof: Feigelson and Babu hoofdstukken 1 t/m 4.

Als U alternatieve bronnen prefereert: dit komt overeen met een selectie van onderwerpen (zeg maar: de hoofdzaken en de praktijk, in tegenstelling tot wiskundige afleiding en bewijs en details) uit Rice hoofdstukken 1 t/m 5 (kansrekening = hoofdstuk 2 van Feigelson and Babu) plus Rice hoofdstukken 8 t/m 10 (statistiek, aka "statistical inference", hoofdstukken 3 en 4 van Feigelsohn and Babu). Wij hebben ons geconcentreerd op de praktijk, terwijl Rice ook de wiskundige onderbouwing levert. In het bijzonder, hebben we maar zeer summier kennisgemaakt met de kansrekening.
In termen van de Penn State summer school: lecture notes on Probability and on Inference (days 1 and 2) plus some material on the bootstrap (day 3) and on Bayesian inference (day 5). (Het boek van F&B is gebaseerd op de materiaal van de summer school).


Hier zijn links naar de tentamen en de hertentamen van vorig jaar (2011):
astrostat tentamen.pdf
astrostat hertentamen.pdf

Bij het opstellen van tentamen vragen laat ik mij voornamelijk inspireren door de werkcollege opgaven van het afgelopen najaar, en door de tentamen en hertentamen van vorig jaar.

Als U nog meer oefenmateriaal zoekt, kunt U kijken naar oude tentamens van de twee wiskunde colleges Kansrekening en Statistiek 1, Kansrekening en Statistiek 2 (eerste en tweede jaars respectievelijk), waaruit de huidige college gegroeid is; en waarbij het boek van Rice wordt gebruikt. Nieuwere namen voor die twee wiskunde colleges zijn Inleiding Kansrekening en Inleiding Statistiek. U kunt oude wiskunde tentamens kansrekening en statistiek (het gaat jullie nu primair om de statistiek) inzien op de twee webbladzijden wiskunde tentamens KS1 en wiskunde tentamens KS2; Inleiding Kansrekening en Inleiding Statistiek.

Natuurlijk zijn er een aantal onderwerpen in de wiskunde colleges, die bij jullie niet aan de orde zijn gekomen (bijvoorbeeld, Markov ketens), en zijn er veel meer wiskundige details en achtergronden geweest. Maar wat de tentamens betreft, voorzover het over de gemeenschappelijke (basis) onderwerpen gaat, constateer ik weinig verschil in (door mij gewenste) niveau en inhoud. Ik zal mijn best te focussen op de onderwerpen die ik tijdens de college als centraal naar voren heb gebracht, en de wiskunde redelijk elementair te houden. Focus op de eenvoudiger opgaven bij de oude tentamens KS1 en KS2 (inmiddels: Inleiding Kansrekening, Inleiding Statistiek).

De twee wiskunde colleges focussen respectievelijk op de kansrekening en de statistiek, waarbij de kansrekening een onmisbaar grondslag vormt voor statistiek. Jullie hebben harder moeten werken aan praktische statistiek door middel van de twee computer opdrachten, dan gewoonlijk is bij de wiskunde studenten. Vele technische achtergrond materiaal over de kansrekening hebben we overgeslagen.

De tentamen zal open boek zijn. Je kunt fotokopieen en aantekeningen meenemen.