A Complete Guide to Self-Learn Data Science Fundamentals

A step-by-step guide to learn all the key concepts and become job-ready

Introduction

Self-learning data science can be stressful. There are a number of topics to learn and practice. Many people fail to sustain the energy required to get past the initial learning phase. The main reason for many people to fail or to see it as a difficult journey is,

Lack of clarity on the topics to learn
No single resource/platform is good to learn everything about data science.
There are a ton of resources on the internet but identifying the ones most suitable for you is challenging
It is easy to get lost in the details
It is not easy to track the progress and test your skill while self-learning

People enrolling for a data science course, don’t face most of these issues. They have a support system to help and guide them. It is not the same case with people who are self-learning. This article will help you to better plan your learning journey. The timelines mentioned here are based on an average person. Depending upon your educational background and experience the timelines could slightly change for you. This plan also includes free resources to learn from for each topic.

Week 1 to 3 — Python Programming

The first step in learning data science is to get comfortable with a programming language. As per the recent Kaggle survey, about 80% of people use Python primarily on their job. If you are new to programming then it is highly recommended to get started with Python.

One of the best introduction courses on Python can be found in Kaggle. Below is the link to the course. It would approximately take 5 hours to complete this introduction course.

Almost anything you would do in a data science project would involve coding. Right from reading the data from the data sources, exploring the data, extracting insights, transforming, feature engineering, building models, and evaluating the performance, and deployment.

It is highly recommended to spend enough time and get familiar with the various functionality of Python. It is not rocket science. It can easily be acquired through practice. About 2–3 weeks will be good for someone with little or no coding experience. But the most important step is to continue practicing coding. The more you practice the better you become!

The key topics to focus on while learning python are,

Basic syntaxes
Collection data types
Control flow
Loops and Iterations
Functions and lambda functions

Week 4 to 6 — Working with Data and Manipulation

The first step in any data science project is to understand the problem from the data point of view. The data you get will never be perfect. It would require a lot of manipulation. The most important Python library that enables working with the data and manipulation is Pandas.

The Pandas library offers a wide range of functionalities that makes data analysis so much easy. If you are new to Python or Pandas then start with this simple 10 minutes tutorial from PyData.

The best way to improve your Pandas skills is by using them more often. Pick an interesting dataset on Kaggle. Note down all the interesting questions for which you need answers. Then start exploring the data and get answers to those questions. Picking up an interesting dataset here is important. It helps in keeping your interests high enough and that helps a lot in the learning.

For example, if you are interested in housing prices, then select a house price dataset. Note down your questions. They can be like,

What is the average price of a property?
What is the average age of the property?
As the property ages, does it impact the overall price?
What factors drive the property price?

The various Pandas concept that you might have to focus on are,

Creating, reading, and writing data frames
Selection and Assignment
Aggregation and Group By
Handling missing data
Merging data from different sources
Summary, crosstab and pivot functionalities

Week 7 to 9 — Working with Arrays

NumPy is the library that enables working efficiently on arrays. Many times we need to work on arrays that could be multi-dimensional. NumPy helps in improving the computation speed and also in making efficient use of memory. It supports many mathematical functions. Not just that it is being used in many other Python packages like Pandas, Matplotlib, scikit-learn, and many others.

In many data science projects, we would be working on numerical data. The non-numerical attributes as well are generally transformed into numerical data. Hence learning to work with NumPy is critical for anyone keen on getting into data science. The key topics to learn about NumPy are,

Creating 1, 2, and 3-Dimensional arrays

Indexing, Slicing, Joining, and Splitting

Iteration and Manipulation

Sort, Search, and Filter

Mathematical and Statistical Operations

Week 10 — Learn Visualization

The success of a data science project depends on,

How well the data science team understands the problem?

How clearly does the data science team communicate the insights?

the one critical element that helps in both is the ability to better visualize the data.

Humans are better at identifying patterns and trends from visual data. It is generally not so easy for the human brain to identify patterns from a tabular or data in other formats. Learning the art of using visualization to analyze and communicate could guarantee success.

There are many packages and libraries supporting visualization. Instead of worrying too much about the different options. If you could follow these simple steps that will be more than enough,

Learn about Matplotlib — It is highly customizable
Learn about Seaborn — It’s not so customizable but very easy and quick to build visuals, a good option for data analysis
Build iterative charts — To better communicate with the end-users

Week 11 to 12— Statistics for Data Science

Statistics are used in every stage of a data science project. Descriptive statistics are useful in better understanding the data and summarising them for easy understanding.

Inferential statistics are very useful for extracting insights that can’t be identified by other means. For example, if we consider real estate data, to know if the rating of the nearest school or the distance from the nearest freeway has a better impact on the prices of the property. Not just in data analysis! While building a predictive model statistics are very useful in measuring the performance of the model.

One important thing to understand while learning statistics is. It is not just a small area that can be covered in a few weeks. There are people who are doing their bachelor's and master’s degrees in Statistics. Your aim should be to just learn enough to get started and as it demands you can refresh your statistics knowledge. The key topics to learn are,

Descriptive and inferential statistics
Type of distribution
Central limit theorem and margin of error
Confidence interval and confidence level
Causation and correlation
Statistical tests

Week 13 to 15 — Learn SQL

Many people interested in learning data science often fail to focus on SQL. In fact, SQL is one of the most important skills required for a data scientist. The data mostly reside in a structured data store and SQL knowledge would be very helpful to work on the data.

Those coming from a non-programming background need to focus and build SQL skills. Those with academic exposure to SQL also need to practice more to understand the key concepts better. In a real-life scenario, the data could be present in different tables at different granularity. Only with good SQL skills, you would be able to bring the data to a format that could answer your questions. One good platform to learn SQL by working on data is,

Below are some of the SQL concepts that are frequently used,

Selecting data spread across different tables
Filtering the required dataset
Aggregating the data to the required granularity
Using Rank() and Row_Num() to select records from a specific sequence
Breaking down complex queries into sub-queries

Week 16 to 20 — Learn Data Analysis and Feature Engineering

The final step in learning the fundamental concepts is exploratory data analysis and feature engineering. In any data science project, more than 70% of the time will be spent on data analysis. While working on predictive problems feature engineering helps in improving the accuracy.

The data analysis and feature engineering skills can’t be learned by just reading or signing up for a course. These skills can only be acquired by practice. The more you remain hands-on while learning the better you would learn and the longer it would stay.

Some of the commonly used feature engineering techniques are,

Binning
Scaling
On-Hot Encoding
Log Transformation

The more you work on the kaggle dataset the more you could learn about feature engineering. The discussion forums are a great place to learn new techniques in feature engineering. There is no correct way or limitations while performing feature engineering. The more innovative you are the better your results could get.