Generally speaking, Statistics is the science dealing with structured dated in order to summarize it and make inferences by modeling and separating a deterministic part and a stochastic part of the data. The deterministic part is (fairly) fixed, the stochastic part varies between individuals (or whichever units the data deals with).
In this course, the focus is on data which may be regarded as a sample from a population. Other measurements, maybe at another time, would have given different data. We set up models to describe the whole population on the basis of the sample data, we estimate parameters of the model and test if the form of the model is reasonably close to the data.
A couple of footnotes need to be made at the beginning. The word “Statistics” comes from Latin origin, it has to do with the state of the universe we are dealing with. On one hand there is the science called Statistics, on the other hand there are various statistics, each one being a single number, for example on sports results, the stock exchange, the weather or traffic on roads or computer networks. These are two separate meanings of the same English word. There is a relationship: We might make a Statistical model of network traffic statistics and in Statistics we compute many statistics such as averages, medians, slopes and standard deviations.
Another footnote should be on the attestation that anything can be proved with statistics. The attestation deals with statistics in the second meaning, numerical reports. It is not correct, but statistics, like all information can be twisted.
A well known simple statistical model describes linear relationship between the height and the weight of people:
weight = constant1 * height + constant0 + individual variation
If the height is measured in centimeters and the weight in kilos, the constant1 is known to be very roughly 1 kg/cm and constant0 about -100 kg. Individual variation is often supposed to be normally distributed with mean 0.
A more concrete form of the model is
weight - average weight = constant1b * ( height - average height )
Here we are stating that the mean weight of individuals 1 cm higher than average is constant1b higher than the average weight of the whole population. Later on we shall see that using a reasonable method to estimate the parameters,
constant1 = = constant1b
The deterministic part of this model states that there is a linear relationship between the height and weight of persons. The stochastic part states that for persons of equal height, variation about the mean weight given by the linear relationship follows a normal distribution.
Thinking critically about this model, we see that both parts are at best approximations, maybe just wrong. The growth of a newborn boy of 3,5 kg and 53 cm into an adult of 90 kg and 190 cm is three-dimensional (height, width and thickness) so the relationship certainly should not be linear but might include a higher power of the height (maybe height ^ 3). By limiting our populations to adults, we may be able to use the linear model as an approximation. The assumption of normal deviations from the average includes an assumption of symmetry in the weight distribution, which evidently is not correct: If the mean weight of persons of a certain height is 80 kg, someone that height could be twice as heavy or 160 kg, equal weight below average would be the impoosible 0 kg. Obviously, variation around the mean can go further up than down. The normal distribution is nevertheless widely used, partly because it is implied in common methods of parameter estimation.
In the model we use height as an independent or explanatory variable, the weight is dependent on the height. The purpose of a model like this one is to explain part of the variation of a variable by the values of another one.
It is helpful to distinguish four categories of variables leading to different categories of statistical models.
Numeric variables where the numbers have meaning as such. Examples: Height of persons, number of persons living in a certain place, temperature inside a computer, amount of money a person has in his pocket. “Twice” or “a hundred more” has meaning for numeric variables. In computers these variables are often represented by double precision numbers.
Ranks indicate the relative position within a sample: Mr. X was first in the Reykjavik Marathon. Ranks can be represented by positive integers, but often they are treated like numeric variables.
Categorical variables describe a well-defined attribute like color, brand or nationality. There is no simple mathematical structure on the set of possible values. Normally described by one or a few words, categorical variables are often numerically coded, but the number have no meaning as such. For example, average zip code makes no sense. Treatment is often limited to counting how many have each value. That count is a set of numerical variables.
Binary or logical variables have just two possible values. Examples: Sex, did the student finish the course, did a signal pass properly? Common coding is by 1 or 2 or by TRUE or FALSE. Binary variables have some properties of numerical variables: Computing the average sex in a group makes perfect sense, it is normally called proportion.
Variables are often grouped in a data table, with each line representing one individual.