Statistics being completely based on mathematics, helps us gain strong insights into the structure of data and come up with concrete solutions instead of guesstimates. This makes statistics a vital part of Data Science.
Let us understand what the basic statistical terms mean..
Confusion matrix is an N*N matrix (N = number of target classes) consisting a count of predicted and actual target values. It helps in evaluating the performance of a classification model.
Let’s take a look at a confusion matrix for binary classification to understand the terminology used.
A structured dataset usually consists of multiple columns with either numerical or categorical data. Machine Learning algorithms only understand numbers and not text. Hence, it becomes necessary for us to convert the textual/ categorical data to numbers before we use it to train a model.
The conversion of categorical data into numerical data is called Categorical Encoding.
In this blog we’ll be looking at two majorly used techniques for categorical encoding:
1. Label Encoding
2. One-Hot Encoding
Say a University conducted a survey to know if it’s students are comfortable with online lectures. The data that we have now is…
Web scraping is extracting large amounts of unstructured data from websites and storing it in a structured format in a desired file/database. We’ll see how it’s done in this blog.
So how do you scrape data from the web?
Have you ever copied and pasted information from websites?
If yes, I would say you’ve already performed web-scraping in a way. But you can’t really copy and paste for say about a 100 times or even more, can you?
So let’s see how Python helps us do the same with the help of one of it’s packages – BeautifulSoup.
Data preprocessing/ Data cleaning/ Data wrangling is a ritual that every data scientist has to perform before the data is used for any machine learning model. In this blog we’ll look into some simplified steps for preprocessing our data.
1. Finding and handling missing values
2. Data Formatting
3. Data Normalization
4. Data Binning / Converting Numerical data to Categorical Data
5. Converting Categorical data to Numerical data
Firstly, know your data! Explore through it.
Click on the link below to read my previous blog and get familiar with some basic Pandas functions.
If you notice I’ve used…
Pandas is a Python package widely used to work with structured data.
In this blog, we will discuss some of the very useful methods in Pandas for analyzing, transforming and generating basic statistics from the data. We will be using a dataset from Kaggle named Insurance_Dataset.
Let’s start with importing the Pandas library.
import pandas as pd
Now, let’s read the dataset into a Pandas dataframe.
Pandas dataframe is a tabular form of data with labelled axes (rows & columns).
read_csv() method is used to read a Comma Separated file into Pandas. If the data is not separated by commas…
Student at NMIMS • Data Science enthusiast!