Mej

After completing 12 years in software QA with a variety of test data, I was tempted to make a career shift into data science and decided to pursue this through a structured masters program. Though I love the three pillars - math, statistics and programming, I did not have an easy start as I am getting back to studies after a long gap of 14 years. As I began learning machine learning, visual analytics, data science, Python, Matlab, R, Tableau, Mondrian etc., I got excited of blogging so as to summarise my learning. I will try to make frequent posts and keep it simple. Looking forward for good learning and sharing time... Cheers, Mej!

Probability Density

An introduction to Probability Density.....

I prefer to use a real world example that is different to the one you normally find like the coin tossing or dice problem. All of us like a change, particularly in choosing an example that will happen to us sometime in our life time. I like to explain the concepts in this series by taking the case of someone expecting a baby. The two common questions are: (i) When is the likely child birth? (ii) What is the gender?

We tend to answer these questions with prior knowledge like – (i) Are results of scan available? (ii) Experience/hunch based questions/intuitions based on mother’s condition etc. This prior knowledge will help in answering these questions but not with complete certainty. I say ‘uncertainty’ as even pre-gender detection with available techniques have been wrong in the past.

  1. Random event – There is no certain answer to these questions as the responses are uncertain, not predictable and the exact outcome can be known only when it occurs. For e.g.: even in the birth of twins, we will know the gender of the first baby when it comes out to the world, but that is no indication of the gender of the other baby.
  2. Discrete variable – At the time of birth, gender can only be a boy or girl and is therefore a discrete variable with only 2 outcomes.
  3. Continuous variable – The child can be born anytime during the gestation period, say – termination due to complications, premature birth or late birth after the due date. So, there is an entire range of values that the birth takes making it a continuous variable.
  4. Probability Mass Function PMF – depicts the probability of having a boy or a girl; both outcomes being discrete random variables. The sum of the two probabilities must be equal to 1. For e.g.: P(boy) = 0.5 P(girl) = 0.5 P(boy) + P(girl) = 1
  5. Probability Density Function PDF – represents the probability of continuous random variable. Imagine we had the data on child birth from conception to delivery and plotted the probability of birth at different points in time (0-10 months). It is likely to give a PDF that is a Gaussian distribution. Any small strip of interval under this curve will represent the probability of child birth as an area or density.
  6. Cumulative Distribution Function CDF – We normally expect the child to be born when 9 months are complete but babies may be born premature in the earlier months or they are so reluctant to come out that it becomes a case of late delivery. If mother develops complications at early stages and is warned of pre-mature birth, nobody thinks of the chance the baby will be born in the 6th month, second day, 4th hour and 50th second. More sensibly we tend to think of the chance or probability that the baby will be born premature before 21 weeks or so in such situations.

This question can be answered from the CDF or PDF, by looking up for the point X=week 21. One can calculate P(birth <=week21) by adding all the probabilities in the PDF up to the point X=week21. The tediousness of calculating this sum, is eased by the CDF which already has the cumulative probability for that point. So a smarter way would be to look up for the point X=week21 in the CDF.

Point to note: The CDF at any point is the sum of all the probabilities towards the left of the corresponding point in the PDF. It is alright to say that CDFs are more useful than PDFs for continuous variables as seen in this example.

Point to note: Discrete variables also have CDF generated the same way but PMFs give the exact probability at any point.

Using CDF:
Assume that you want to know the probability that the baby will be born when 56 weeks is nearly complete. We can take a small strip of interval to the left and right of the desired point say, x=56weeks in the PDF (i.e., interval range 56*5days to 56*9days) and integrate the area under the interval. You might wonder why we wouldn’t we look up for the probability at exact 56week completion point as in the case of PMF for discrete variables. Well, it is zero in the PDF for continuous variables.

Remember, the x-axis for continuous variable contains infinite number of points. If each of these points had a very small positive value, then the sum of the probabilities of these infinite points will be infinite. But, we know that sum of all probabilities will have to be 1 and we can assign positive probability only to an area but not for a point. If you are still unconvinced of this concept, here’s another explanation! Say you are integrating the probability at the point of interest. Mathematically, the definite integral integrates between the same start and end points and hence, the value will be zero over the zero interval.

Point to note: Probability can be obtained for a point in the PMF for discrete variables and an interval in the PDF for continuous variables.

Mathematical relationship between PDF and CDF: PDF is the derivative of CDF
It has been already said above that summation of PDF gives CDF. Mathematically the integral of PDF is CDF so the reverse of it, i.e. the derivative of CDF is PDF holds true. I also like to make it more obvious by stating that PDF is the rate of change of CDF. This means that CDF captures the jump in the PDF i.e. if PDF increases, the CDF also increases by the same amount OR if PDF does not change, then CDF remains flat.



No comments:

Post a Comment

Wanna search?