Statistics in Data Science

DATA SCIENCE WITH KESHAV - LESSON 3: Statistics in Data Science

Hello, and welcome to Lesson 3 of my tutorial series, “Data Science with Keshav“. To get an overview of what this tutorial series is about, you can check out my another post, Data Science 101. In this part of this tutorial series, we are going to learn some concepts of statistics.

When you google the definition of statistics, you will get the meaning of statistics as:

“the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.”

To be frank, statistics forms the foundation of data science. Let’s go through this definition. Statistics is all about inferring whole from a representative sample. But wait, what is this “sample”? A sample is a proportion of the entire data sets (population). The sample must be representative of an entire population. There are lots of processes to make samples, collectively called as sampling procedure, like simple random sampling, stratified sampling and so on. Well, for now, we will not discuss the sampling process. Let’s discuss some fundamental things we need for our journey in the world of data science, this doesn’t mean we don’t need the concept of sampling. We need sampling, but we will deal with this later.

There are two main things you need to understand, one is inferential statistics another is descriptive statistics. You will need to understand these two techniques first. Descriptive statistics is about summarizing and describing data whereas inferential statistics is all about drawing statistical inference. This is “what is going on?” versus “what it implies?” thing.

Ok, back into the fundamental concept you need to carry on. Let’s talk about two main ideas in statistics

  1. Measure of Central Tendency and
  2. Measure of Dispersion

The measure of central tendency gives you an idea about the representative notion of your data. The measure of central tendency is all about measuring average, there are several ways of measuring averages, popular of them being Mean, Median and Mode. Each of these three measures averages in different ways, but somehow they try to generalize entire data sets. However, they have some significant benefits over costs. Mean takes an account of entire data sets but gets affected by extreme values whereas median comes into play where extreme values don’t affect significantly but account for entire populations is much significant in the case of mean. And Mode doesn’t account for entire population. This gives us a lot of choice with a lot of confusion. Which average do I choose? Well, this depends upon your problem domain. Let’s say if you are running a clothes shop, yes you would like to focus on fashion trends where which types of fashion are in high selling chances, you preferably need Mode here. I think I must leave this section for your research portion. I believe you will comment your findings so that we can discuss over it.

Suppose I calculate mean marks of 100 students and got 89, I made a choice to score all my student with same marks, because I believe mean generalizes my data. There is a student say Mr. X who is happy about this since he got only 22 where is Mr. Y seems sad and angry with me because he had scored 100/100 but got only 89. Where, does this 11 point go? This is a very important question. Now, to check if I am right or wrong I must check all my student’s status, suppose students around marks 89 seems neutral, and I got a significant majority of students being neutral I must admit I can consider this generalization to be correct however this will be very wrong if such is not the case. So, what makes my mean unfit for generalization? Here we need to talk about measures of dispersion.

Measures of dispersion give us an insight into what level our data is being deviated from generalization capabilities. Suppose I want to measure this deviation as mean squared deviation that is sumof((x-mean)^2)/n, where x is each data points and n is total sample space. This measure of deviation is what we called as standard deviation. The concept of standard deviation is very important because this concept will give us an eye to see new concepts that we will soon cover in upcoming sessions. These are all about distribution and concepts of probability.

Well, what is the meaning of standard deviation? Suppose (I will clarify about normal distribution in coming sessions, but here I will assume you know this thing ahead), our distribution of marks of 100 students are normally distributed, that is mean is zero, and standard deviation is 1, from -1 standard deviation to +1 standard deviation we will have around 68% of data and from -2 sd to +2 sd we will have over 95% of data. This makes things clearer. Suppose we have a low standard deviation that means we will have more data density around mean. And if we have high standard deviation we will have a more sparse distribution of data sets. This will explain the variation of our data sets.

With these concepts of measure of central tendency and measure of dispersion, we can move ahead and introduce new concepts about different distribution in statistics.

I know, we are not yet in coding, but let us build our understanding first so that whenever I talk about something in statistics this might not be jargons for you in future.

In next article, I will shortly talk about few distributions and their uses and about probability.


About Keshav Bhandari 6 Articles
Deel Learning Enthusiast. PhD Student at Texas State University.

Be the first to comment

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.