This article helps to understand distribution functions and its usage in Exploratory Data Analysis in Data Science. In next article, I’ll take you to some of the practical usages on my sample project for the terms defined here.
Exploratory Data Analysis is the combination of many small tasks like data cleansing, data munging and create visualization etc to understand the value in data. In the distribution of data, we actually try to extract value out of it. Also, distribution is important when the data is ready for analysis and we have received another set of sample data then we do few tests like T-Test or Z-Test to see whether sample data is part of our larger dataset or not. Hope you get an idea why the understanding of the distribution is very important in whole data science process. Lets start with distribution first.
As we know Random variable(denoted as X) is a variable whose values(denoted as x-small) are the numerical outcome of a random phenomenon or we can say it takes many numeric values of a random event associated to them with different probabilities in universe.
There are two types of random variable Discrete Random Variable and Continous Random Variable.
1. Discrete Random Variable: This variable takes distinct numeric value. For example a coin flip. Here Random Variable X of flipping a coin and probability for outcome of head is 0.5.
2. Continous Random Variable: This variable takes value from a long continuous list. For example, we want to find the exact weight of an animal in zoo. So the exact value of weight that is now would not be same as tomorrow.
Now understanding of, how the random variable is distributed to its value, is what we define in distribution function. In normally distributed variable there is a 68-95-99.7 rule which simply states that for a normally distributed variable, roughly 68% of values will fall within one standard deviation of the mean, 95% of values within two standard deviations, and 99.7% within three standard deviations. These intervals are called as confidence values but there are many circumstances where heavily tailed distributions you would expect less of the distribution to fall within each multiple of the standard deviation and expect the opposite for less heavily tailed distributions.
We use distribution on data, we actually try to extract value out of it and these distributions are important to generate certain hypothesis about the data.
Some of the basic distribution functions are:
1. Bernoulli Distribution: You can think it as flip of a coin where it has two possible outcomes is head or tail and another case for example if a person having .34 probability to have health insurance say 1 otherwise 0. So hows the distribution looks like.
2. Binomial Distribution: It is the extension of Bernoulli Distribution. For example lets do a trial and find out of 10 random pick person what is the probability that first three have health insurance, P(YYYNNNNNNN) = (.34)^3) * (.64^7). Many interesting problems can be addressed via the binomial distribution. However, for large Ns, the binomial distribution can get to be quite awkward to work with. Fortunately, as N becomes large, the binomial distribution becomes more and more symmetric, and begins to converge to a normal distribution.
3. Normal Probability Distribution: It is the bell shape curve mostly occurs everywhere in nature and in the manmade world. The normal distribution can be characterized by the mean and standard deviation. The mean determines where the peak occurs, which is at 0 in the figure below for all the curves. The standard deviation is a measure of the spread of the normal probability distribution, which can be seen as differing widths of the bell curves in below figure.
4. Markov’s inequality distribution: So normal distribution behaved like the tail of the distribution is very small and carry very less weight but Markov concern about that the tail actually not that badly behaved as we have seen normal distribution.
Let us assume that we have created few hypotheses based on the distribution of data and data points. Now you receive another sample data and required to verify that this sample is part of your dataset or not. So here you use either t-test or z-test.
If you don’t know the true mean and your sample data is smaller then 30 then use t-test otherwise z-test when mean and standard deviation of the larger dataset are known. Z-test uses z-score to get the value of data points when eighter value or standard deviation is known.
Now, these test can be done on single sample data and pairwise sample data that I’ll explain by taking some practical example in next blog.