Hello, and welcome to Lesson 2 of my tutorial series, “Data Science with Keshav“. To get an overview of what this tutorial series is about, you can check out my another post, Data Science 101. I also recommend you to have a look at Lesson 1 of the tutorial series titled Python Virtual Environment and Linux User Management if you haven’t already. In this part of this tutorial series, I’m going to give an overview of the technology, programming language, and libraries that we will be using.
Welcome, let us start by asking the same question, What is Data Science? Data Science is a field of mixed domains. Mathematics, Statistics, Programming, analytical skills and domain knowledge together fused as an interdisciplinary field. The term data science seems new and trendy, however, the techniques and principles are being practiced since a long time. The massive growth and availability of data have pushed us to seek insight in data. Thus the realm of Data Science is getting more and more popular, as it can unleash the real power of data. In the past decade, we lacked data as well as sophisticated technology to process such huge amounts of data. But with the advent of technology, open source community, and tech giants like google, facebook, Microsoft etc, advancement is as fast as lightning today.
You might have heard terms like business analytics, or business intelligence, risk modeling, predictive shopping, demand prediction and many more. Nowadays, people are coining data science with various names. However, the core essence of data science truly lies in common fundamental principles indifferent to the terminology used. Same goes for the technology used for Data Science. You might meet a friend or a person using R for data analysis, while others might use python, or both. But behind the scene, they are relying on common fundamental principles.
Programming Language, OS, and Libraries for Data Science
Well, in this post I would like to talk about few of the technologies we are going to cover in this learning path. For now we are not going to talk about big data and related technology like hadoop and spark.
In our learning path, we will learn python as our programming language with linux as our operating system. However, you can also use windows or mac as well, but I prefer using linux. If you want to know, why linux post your question and we can discuss this in comment threads below.
Now if you are thinking why python? Python is the best, yes you heard right. Well, there are lots of other platform, but choosing python is a wise decision if you are in data science. Python is advanced and at the same time, it can provide flexible interactive way to work. This is really handy if you want agility in data science stuff. Also, if you have large project, python is still fit for you. I must say that if you are into data science you must also need to be familiar with R programming. R programming is statistical programming, and gives you off the shelf tools for statistical calculation. However learning R has stiff curve, but something tells me that’s not going to stop you. I suggest you to go for learning R as well, this will be helpful, but remember make python as your primary programming language.
We will be discussing about several libraries throughout the course, library like pandas (for data manipulation and basic data analysis) , statsmodels for data analysis, matplotlib, bokeh, seaborn etc. (we will give more time on matplotlib) for data visualization, moreover I will introduce library like scikit-learn, numpy and scipy as well.
These later libraries are for advance data analysis techniques, where we use machine learning techniques.
Well, this much for now, let’s continue in another post.