Statistics Without Tears

As someone deeply engaged in making complex concepts accessible, I recently delved into ‘Statistics without Tears’ by Derek Rowntree. In my blog, I share insights from this classic book which uniquely demystifies statistics using words and diagrams instead of formulas and equations. It’s an excellent resource for those seeking to understand the ideas behind statistics, minus the intimidating mathematical computations.

Wel here it goes my notes:

1. Statistical Inquiry

C. Descriptive and Inferential Statistics

When looking at data one is not able to test all the population of the target. For example when looking at the lifetime of lightbulbs one cannot test the population(all lightbulbs in existence) therefore she needs to test a sample.

A sample is chosen within population to represent populations demographics.

Descriptive statistics is a method used to summarize or describe our observations. Descriptive statistics is concerned with the sample. For example saying 50% of the experimented high schoolers rebelled to their parents is descriptive statistics.

Inferential statistics is the method to derive meaning for the population from the sample. In another word it uses descriptive observations for making estimates or predictions. For example saying half of the highschoolers would rebel to their parents is inferential statistics.

Inferential statistics may not be the safest method to have information but its the most cost effective method to get information with an acceptable margin of error.

D. Collecting A Sample

A sample is misleading unless it is representative of the population. For example if you interview 100 person from a popular street can you make inferential assumptions for the whole city? You can but you shouldn’t because sample may not represent the population. You would only pick type of people who walk on that street on that specific time and volunteered to be interviewed.

Then you will have introduced an imbalance or bias into your sample. It won’t be random and representative.

Even you make it totally random it still may not represent the population. For example if you interview people about restrooms and if you pick interviewees at completely random you may get more male students than female students unrepresentative of the population.

So getting people on random without bias is not enough. You should get them to represent the groups within the population. We call this stratified random sample. That is we have realized in advance that different group within the population and choose our sample group within this groups of population randomly.

2. Describing our Sample

What we do with the members of a sample once we have got them. As far as statistics is concerned whatever we do with them we are going to produce a set of numbers related to whichever of their common characteristics we are interested in.

A. Statistical Variables

  • Variables: In looking at the members of a sample we ask how they vary among themselves on one or more characteristics. This is called variables. There are 2 main types of variables. Category Variables and Quantified Variables.
    • Category: Refers to any variable that involves putting individual into categories.
      • Nominal: If a category have a name that is not ordinal within each other we call them nominal variables. For example a bicycle brand is a nominal variable.
      • Ordinal: If a category can be differentiated by better higher cooler or faster type of adjectives it means we can arrange them in order. So this kind of variable called ordinal variables. It has 2 type. Ordered and Ranks
        • Ordered Categories: Calling them categorized names such as good best moderate is ordinal yet doesnt talk about the rank between each member.
        • Ranks: Talks about how each individual ranks according to another. So like ranking them 1 through 10 is a ranked variable.
    • Quantity:All such variables where what we are looking for is a numerical value a quantity we’ll call quantity variables.
      • Discrete(Counting): is one in which the possible values are clearly seperated from one another. Like a family has 2 or 3 members. Not 2.5
      • Continuous(Measuring): On the other hand continuous variables whatever measurement you give there is always more measurements in between. For example height can be described as 150cm but it could be 150.0001 cm.

Quantity variables can be turned into category variables to simplify data. For example under 150 cm people can be called small or above 180 cm could be considered tall. This results in loss of information though. So one needs to measure pros and cons carefully before doing this.

Error, Accuracy, And Approximations

There is no way to get data that is 100% accurate. So error is inevitable. Therefore it is sensible to remember our observations are merely approximations.

3. Summarizing Our Data

A. Tables and Diagrams

Set of raw data can be obscure. The first step is to rearrange data to make more sense. Grouping data points can also help to emphasize any pattern within distribution. And diagrams give a better idea of the shape of the distribution than do figures alone.

Also we have look for the figures like mod average or median that quanitfy important features of distribution. In fact to describe a distribution statistically or to use it in making inferences or predictions we must have such figures. Now lets look at Central tendency(Averages) and dispersion.

B. Central Tendency( Averages)

There is 3 methods to Central tendency. Mode, Median, Mean. Which one used will depend on the type of variable.

Mode is most likely to be used when our data concerns categories, it will not be much used with quantity variables. There the average used will probably be the mean or, occasionally the median.

So the median is in distributions where there are a few extreme values observed. These are called outliers. Their effect would be to distort the mean pulling it too far way from the center of the distribution.

C. Measures of Dispersion

There is 3 methods to measures of dispersion. Range, Inter Quartile Range and Standard Deviation.

Range: is a rough and ready measure of dispersion. Its great virtue is that it is so easy to calculate and it is fairly apparent. MaxValue-MinValue = Range

Inter Quartile Range: Sometimes range doesn’t mean much. Because the data may be too much dispersed. So you divide into 4 and center become median. This shows distribution of values.

Standard Deviation: Like the mean the standard deviation takes all the observed values into account. If there were no dispersion at all in a distribution all the observed values would be the same. The mean would also be the same as this repeated value. No observed value would deviate or differ from the mean.

But with dispersion the observed values do deviate from the mean. Quoting the standad deviation of a distribution isa way of indicating a kind of average amount by which all the values deviate from the mean. Formula is sum of deviation from mean squared insquare root.

4. Shape of A Distribution

When biological variations are concerned results tend to be symmetric. But not all statistical results are symmetrical.

A. Skewed Distribution

Not every distribution has to be symmetrical. Some distributions can be skewed toward one end.

A distribution can be positively or negatively skewed.

Positive distribution: is the skew is on the right(positive side)

Negative Distribution: is the skew is on the left(negative side)

A symmetrical distribution has all three mod median and mean in the middle. But if a graph skew toward one end mean will have been pulled out in the direction of the skew. And median will be in the middle.

Symmetrical Distribution have all 3 in the middle:

When distribution skews to one side:

B. Introducing the Normal Distribution

If we join up the uppermost dots in a dot-diagram or the tops of the bars in a histogram with a shmooth curved line, we get what is called the curve of the distribution.

If there is enough data normal distribution tends to look like bell shape. And according to standard deviation height of the bell will be decided. Lower height indicates lower SD and higher height indicates higher SD.

If mean is 80 and SD is 6 86 is above 1 SD from mean and 74 is belove 1 SD.

C. Proportions Under the Normal Curve

Normal curve is a theoretical fully symmetric curve. Especially in biological distributions it tend to show itself. 1 SD below and above median contains 68% of data. 2 SD below and above is contains 95% of data.

D. Comparing Values

Now lets try to get data out of this curve. Lets say Linda got 80 from his law exam and Carl got 90 from his math exam. Which student did better among its peers. Lindas exams mean was 60 and SD was 5 points. So linda did 4 SD better than her classes mean. And Carls exams mean was 70 and SD was 10 so he did 2SD better than his class. Then we can say Linda did way better than Carl when compared to he rpeers.

5. From Sample to Population

So since now we understand that we can access some part of the raw data using mean and standart deviation lets talk about how we can make predictions about the population that sample supposed to represent.

A. Estimates and Inferences

The figures we used to describe our samples can be seen as estimates of population. It is underlined by using the word statistics for the figures derived from the sample. While the word parameter is used for the true mean mode etc of the population.

So a statistics is used to estimate a parameter.

This process is called statistical inference.

We can simply make inferences base on very little information however we need to update our inferences when we faced with more observation. With increasing information more and more of the possible inferences can be ruled out and we can have closer to truth inferences. But we can never end up with certain knowledge about the population.

Sample mean or statistical mean is the mean derived from sample. While parameter mean or population mean is the real mean.

B. Logic of Sampling

The means of samples are variable. If you choose different samples statistics will be different. Especially in smaller sample gorups. This variability from one sample to another is known as sampling variation.

C. A Distribution of Sample-Means

So if you take 10 different samples and get their SD and put it on graph you can see it will shape like a normal curve. Samples whose observed values are mostly close to the P-mean are likely to be more numerous than samples with a lot of values very distant from P mean. Thus in turn samples whose S-means are similar to the population mean are also likely to be more frequent than samples whose S-means are very different from it.

In the case of the sampling distribution of S-means the mean will be the same as that of the population; but standart deviation will be smaller than that of the population.

The standart deviation of a sampling distribution is called the Standard Error

We can say that the standard error of the mean depends on the size of the sample and its standart deviation.

D. Estimating the Population Mean

So how can we estimate the population mean with one sample mean. There is 68% probability that the P-mean lies within 1 SE of the S-mean. Thus we can say that we are 68% confident that the mean mark of the total population of students lies between S-mean-+1 SE This range would be called condifence interval. If you raise the SE to +- 2 you get 95% certainty and 99.7% certainty if you raise it to 99.7%

E. Estimating Other Parameters

One parameter we might particularly wish to know is proportion. This will be the case when the characteristic we are interested is a category-variable.

But the distribution of such percentages would be approximately normal and centered around the true proportion. We can thus calculate the standart error of proportion.

It is calculated from the sample by multiplying the proportion we are interested in by the proportion remaining dividing by the number of cases in the sample and taking square root.

In general if we want to be sure of estimating the mean of the proportion within the population with a certain precision and certain confidence level we could work out how large a sample we’d need in order to reproduce the standard error to the necessary level.

Whatever our sample and the statistics we draw from that sample we can never use them to say exactly what the mean (or proportion for any other parameter) of the population really is.

6. Comparing Samples

Till now we talked about how to work on one sample group from one population. But statistics commonly measure different sample groups from different populations. Now lets talk about comparing samples.

A. From the Same or Different Populations

Knowing as we do about the effects of chance on sampling variation, we dont expect two random samples drawn from the same population to have exactly the same mean. But we also know that we are more likely to draw a pair of sample swith means fairly close together than a pair whose means are far apart.

B. Significance Testing

Main objective is to test two samples against each other. For example blood pressure differences between man and woman. Are they same or not?

If we take a sample from both side and test it and see a difference between these samples How likely would that be caused by sampling error.(Standart error)

So lets assume we have infinite sample groups from one population and enter the differences into a graph. We have to assume Null hypotesis to start with. Which means assume that both group have the same mean of blood pressure. If that were the case we would have a normal distribution(bell curved)

Now lets see the difference we found from our samples. If we found difference is 10mm of blood pressure and the SD is 11.3 Than we can calculate SE-diff. Which is 1.6 squared time two in a square root. Which is 2.26mm so if there is no real difference between male and female blood pressure the SD would be to 2.26mm.

So if more than one smaple group shows difference around 10mm it is very unlikely that Null hypotesis is correct in this case. Because if standart deviation is 2.26mm a samples chance to be away from 0 would be less than 0.001% Having multiple cases indicating that is proof enough that mean is not 0 and mean is different than 0. Therefore we can prove that male and female blood pressures are different.

C. Significance of Significance

Lets talk how certain we can be to reject null hypotesis. Lets imagine a court where they only accept proofs that are hard as rock. What will be the outcome? Lots of guilty people not get punished. How about if they were to lenient of proofs strongness. This time lots of innocent people will get condemned. SO there must be a balance between these.

It is the same with judging significance on our samples.

Significance in statistics means reliability of our data. We have two types of errors Type 1 and Type 2. Type 1 is when you reject null hypotesis when its true and Type 2 is when you accept null hypotesis when its wrong.

Like with the court cases the more we reduce our risk of making a Type 1 error(by demanding more significant differences) the more we increase our risk of making a Type 2 error.

So you look at standart error. If the standart error is aroound 1% this means we have chosen a harmonious group and we can be highly certain of our data. If its 5% its less reliable but still good. Statisticians call 5% significant and 1% highly significant.

It may be a hard concept to understand but the more and more confident we wish to be that we are not going to claim a real difference when there is none,(forcing 1% SD) the bigger the difference we ‘ll demand between our samples. But the more we decrease our probabilityy of Type 1 error the bigger the diffferences we’ll have chosen to disbelieve as indicators of real population differences. And the bigger the difference we ignore the more likely it is that some of them do represent distinct populations even though we have decided to believe otherwise.

D. Comparing Dispersions

We have talked about difference in mean generally up to now. But we also need to check dispersion. Our test on difference in means would only make sense difference in dispersion is relatively similar.

So we need to check SD difference between samples. And if its too high it means dispersion is different.

When to samples differ in both mean and SD it may be sensible to compare the dispersions first. If the dispersions turn out to be significantly different, then it may be safer to assert that samples probably do come from different populations and leave it at that.

E. Non-parametric Methods

Up until now we always assumed distribution is normal. But thats not always the case. Especially not it rank and category samples. There are othermethods to calculate these samples like Mann Whitney U test but we will not get into those in this book.

7. Further Matters of Significance