We rarely know everything. Most problems we solve in real life are based on the generalization of our limited knowledge. Statistics is the formal technique that helps us make sense out of such data. Statistics provides a conceptual baseline for machine learning.

Statistics is too vast a subject to be covered in a blog. But, here I plan to brush through some of the important concepts that are often quoted in a statistical analysis.

Random Experiment

Nothing in this world is truly independent. Every event is influenced by various different factors like past and the surroundings. Yet, as a theoretical concept, a random experiment is an isolated experiment that is not influenced by anything else. It has a range of possible outcomes and any of these outcomes can show up - independent of any other factor - including its own past. The set of all possible outcomes is called the sample space.

Although a truly random event is only a theoretical concept, it is often a close approximation of an event that is influenced by too many parameters.

Consider for example, the throw of a dice. Its outcome is clearly and precisely defined by the laws of physics. It depends upon how the person held the dice to start with, the way in which he tossed it, the distance from the surface on which it banged, the elasticity of the surface and the material of the dice, the friction and direction of the air around, the friction of the surface where it finally settles . . .

Every time the person throws the dice, the state of his mind is influenced by the previous throws - that in turn influences the action of his hand. Each time the dice is thrown, it undergoes some some ware. In that sense, each new throw is influenced by the previous throws.

The outcome is certainly not random in the true sense. But since these parameters are too many in number, their interaction is too very complex for the the small sample space; the experiment is best approximated as a random experiment.

An important aspect here is that it is not simple to predict the outcome. Even if the dice is biased, we can never make an accurate prediction. Here, the outcome is random - just that some probabilities are more than the others.

Union and Intersection

In simple set theory, a union of two sets is the set of points that are part of at lest one of the two sets. And intersection of two sets is the set of points that are part of both the sets.

The same is extended to statistics:

P(A ∪ B) = Probability of event A or event B
P(A ∩ B) = Probability of event A and event B
P(A / B) = Probability of A given that B has occurred

These can be intuitively related to each other as follows:

P(A ∩ B) = P(A) + P(B) - P(A ∪ B)
P(A ∪ B) = P(A / B) * P(B) = P(B / A) * P(A)        # Baye's Theorem

Often, we drop the explicit symbol for intersection. Thus, P(A ∩ B) is also denoted as P(AB).

Mutually Exclusive Events

Two events A and B are called mutually exclusive if they can not occur together. That is, if A occurs, we can be sure that B has not occurred. If the outcome of a dice is 2, it cannot be 3! But if it is an even number, it could also be divisible by 3. Thus, the first is a mutually exclusive pair of events. The second is an intersecting pair.

Some intuitive properties of mutually exclusive events A & B are:

# They cannot occur together. So probability of intersection is 0
P(A ∩ B) = 0

# Probability of union is sum of individual probabilities
P(A ∪ B) = P(A) + P(B)

Independent Events

Two events are independent if they do not affect each other. That is, the probability of an event A does not change by the fact that B has occurred. That is, P(A / B) is same as P(A).

Thus, extending the Baye's theorem we discussed above, if A and B are independent,

P(A / B) = P(A)
P(A ∩ B) = P(A) * P(B)

Conditional Independence

We can stretch our imagination a little more to understand the concept of conditional independence. Two events A and B may not be independent. But, given that event C has occurred, A and B become independent. Then A and B are called conditionally independent.

P(A / BC) = P(A / C)
P(B / AC) = P(B / C)

To understand this better, let us consider the set of natural numbers. The probability of picking an odd number (A) is 1/2. The probability of picking an odd number given that you have a prime number (B) is almost 1. So A and B are certainly not independent. But, given that the numbers are less than 5 (C), then the equations change.

P(B) ≠ P(B/A)
P(B/C) = P(B/AC)

Probability Mass Function

This is a simple numerical function that gives the probability for a given event. That is, for any given event in the sample space, it gives the value of its probability.

For example, in the case of an unbiased dice, the probability mass function is 1/6 for all natural numbers up to 6 (and 0 beyond that).

If we throw the dice twice and add the outcomes, we can get a slightly involved function:

f(n) = (n < 8) ? (n-1) / 36 : (13 - n) / 36

Probability Density Function

Please note that the probability mass function is meaningful only when we have a few discrete outcomes. But, if we have a continuous range of outcomes, then the probability density makes a lot more sense.

Probability density is defined as a function P(x) such that ∫xy P(x)dx is the probability that the outcome is between x and y

Mean, Median and Mode

These are perhaps the most referred in statistics. Mean typically refers to the arithmetic mean or the geometric mean. Median is the value of the value in the center of the sorted set. Mode is the most frequently occurring value in the set.

When the Mean is equal to the Median that is equal to the Mode, we have a symmetric distribution. If it is not, we have a "skewed" distribution - when a major chunk of the data is on one side of the distribution. Thus, we have left skewed and right skewed distributions.

The variance is an indication of how well behaved is the set as a whole. Low variance indicates that the data is pretty much similar to the mean. High variance indicates that most of the data is way different from the mean. Similarly, Kurtosis indicates how strong are the tails. High Kurtosis indicates stronger tails.

Covariance & Correlation

This indicates the amount of relation between two numbers. For example, if the two numbers vary together - increase in one often occurs along with an increase (or decrease) in the other, then they are said to be correlated.

The correlation coefficient is an indicator of the amount of dependence. It could be positive or negative depending upon the kind of covariance. If it is 0, then the variables are said to be independent of each other.


Statistics deals with several different types of variable distributions. Some of the important ones are :

  • Uniform Distribution - The probability density is a constant over the range and 0 otherwise.
  • Normal (Gaussian) Distribution - The most commonly referred in a lot of places. Here, the probability density function decays exponentially as we go away from the mean.
P(x) = e-(x-μ)/2σ2/(2π)1/2σ
  • Multivariate Normal Distribution - It is similar to the Normal Distribution, except that it is over multiple dimensions.
  • Bernoulli Distribution - This is a simple distribution when we have only two possible outcomes that are exclusive and exhaustive. Thus, their probabilities add up to 1. For example tossing a coin.
  • Binomial Distribution - This is an extension of the Bernoulli Distribution - over multiple trials not just one. For example, tossing a coin several times.
  • Poisson Distribution - This is a highly skewed distribution that decays exponentially only in one direction. For example, the distribution of radioactivity over time. It just decreases exponentially with time.