What is Statistics?

Statistics is the study of collecting, analyzing, interpreting and presenting data. It can be used to infer trends and draw conclusions about large populations by analyzing samples taken from the population. There are two main statistical methods in data analysis:

  1. Descriptive statistics: Summarize data from a sample using measures such as mean and standard deviation.
  2. Inferential statistics: Draw conclusions about the population.

Together, they are used in data science to make useful inferences and predictions in real world situations.

Basic Terminology

Before moving on, let us familiarize ourselves with often-used terms in statistics.

  • Population: The entire set of data that we are studying and drawing conclusions about.
  • Sample: A subset of the population.

For instance, let’s say we were trying to find out the mean height of students in a school. Instead of measuring every students’ height, which might take a lot of time and effort, we randomly selected and calculated the mean height of 50 students (descriptive statistics) to infer the mean height of students in the school (inferential statistics). In this case, the population would be all the students in the school and the sample would be the 50 randomly chosen students.

Random Variable

A random variable is a quantity whose values depend on outcomes of a random phenomenon. There are two kinds of random variables:

  • Discrete random variables: A random variable that takes on any value from a countable set of distinct values.

Some examples of this would be:

    • Outcome of a coin flip
    • Number of defective bulbs in a box of 20
    • Number of people wearing blue in a class

From these examples, we see that a discrete random variable has a countable number of possible outcomes. The outcome of a coin flip can only be heads (1) or tails (0). We cannot get any other outcome like 0.5 (half head and half tails?). Similarly, we cannot get 2.3 defective bulbs in a box of 20 or 10.9 people wearing blue in a class.

  • Continuous random variables: A random variable that can take any value within a certain interval.

Some examples of this would be:

    • Height of individuals
    • Time taken to finish a test
    • Amount of sugar in a drink

From these examples, we can see that continuous random variables can take any value, including fractions and decimals, within a certain range. The height of a person can be 1.61m, 1.73m, 1.88m etc. It is not restricted to specific values like 1m or 2m. Similarly, the time taken to finish a test and the amount of sugar in a drink can vary indefinitely.

Formally, a random variable \(X\) is a function that maps outcomes in a sample space \(\Omega\) to some real numbers. In the case of a coin flip, the outcomes are \(\{H,T\}\) and the random variable \(X\) maps it to \(\{1,0\}\). The random variable of a coin flip is:

\(X(\Omega=H)=1\) (the outcome heads is mapped to 1)

\(X(\Omega=T)=0\) (the outcome tails is mapped to 0)

A random variable has a probability distribution \(P\), which specifies the probability of the random variable taking a specific value or values, or more simply, the likelihood of getting a specific outcome. Following the coin flip example, assuming it is a fair coin flip, the probability that \(X=1\) denoted as \(P(X=1)\), the likelihood of the coin landing on heads \(H\), is 0.5. The probability mass function (probability distribution of discrete random variables) of a fair coin flip is:

\(P(X=1)=P(\Omega=H)=0.5\)

\(P(X=0)=P(\Omega=T)=0.5\)

The sum of probabilities for every outcome must be 1.