Keep Your Data Clean

Akshada Gaonkar
The Startup
Published in
6 min readMay 23, 2020

--

Data preprocessing/ Data cleaning/ Data wrangling is a ritual that every data scientist has to perform before the data is used for any machine learning model. In this blog we’ll look into some simplified steps for preprocessing our data.

1. Finding and handling missing values
2. Data Formatting
3. Data Normalization
4. Data Binning / Converting Numerical data to Categorical Data
5. Converting Categorical data to Numerical data

Firstly, know your data! Explore through it.
Need help?
Click on the link below to read my previous blog and get familiar with some basic Pandas functions.

If you notice I’ve used the same insurance.csv dataset (with few changes in it) to illustrate some examples in this blog.

Let’s get started..

Dealing with missing values

Missing values could be represented by ‘?’, ‘N/A’, 0 or just a blank cell. This missing data can affect the machine learning models and cause misleading results.

So how can we deal with these missing values?

First option is to go check with the data collection source if it’s possible to find out what the actual value should be.

If you still have missing values to deal with, another option is to simply get rid of them. Drop them! You can either drop the whole variable or just a single data entry with the missing value.

  • If there aren’t too many missing values, it’s better to drop only that particular entry or you’ll end up wasting a lot of data & it’s definitely not a good practice.
  • Always make sure that the data you’re removing has the least impact on your target variable.

To avoid wasting data, we can consider the third option i.e. replacing missing data with the help of data that’s already available. This will reduce the accuracy to some extent as it’s just a guess to what the value should be.

And finally, there might be times when you have to just leave the missing entries missing.

Let’s now see how can we actually drop or replace values in our dataset.

df.isnull().sum()
Number of missing values

We’ll now drop data entries where the values of the feature, smoker are missing.

df["smoker"].dropna(axis = 0, inplace = True)

Next we have 14 missing values for bmi. Let’s try replacing the missing entries with an average of the available values of the feature.

df["bmi"].fillna(df["bmi"].mean(), inplace = True)

Now we are left with the feature region. Region is of object type so we obviously can’t take it’s average. In this case, we’ll replace the missing values with the most occurring value of region (it can be found out easily using the mode() method).

df["region"].mode()
df["region"].fillna("southeast", inplace = True)

Let’s see if we have any more missing values.

No more missing values. We’re good to go.

Data Formatting

Data is usually collected from different sources by different people and stored in different formats. Converting this data into a common standard of expression, i.e. consistent makes the job easier. For example, the value ‘female’ can be entered as ‘f’, ‘F’ or ‘Female’ by different people, so changing it into a consistent value is important.

Next, use the dtypes function to check if all variables of each column have the right data type. In case there’s a wrong data type, method astypes() can be used to convert data from one type to another.

For example, in our dataset, the variable children is of type float. And we know number of children will never have a decimal value.

df.dtypes
df["children"].astype(int)

Data Normalization

Normalization is making ranges between variables consistent. It is necessary for conducting a fair comparison between different values and making statistical analysis easier.

For example, in the dataset we’re using, bmi ranges from 15.96 to 53.13 and children ranges from 0 to 3. Say our target variable is charges. While implementing various machine learning algorithms, bmi will influence the result more but that doesn’t mean that age can’t be an important affecting factor. Normalization will ensure that both variables have a similar influence on the result.

But how can we normalize huge amounts of data? Let’s see.

We’ll discuss 3 most general techniques that can be used in this blog:

  • Simple feature scaling
    Each value is divided by the maximum value of that feature. The resulting values will range between 0 and 1.
df["bmi"] = df["bmi"]/df["bmi"].max()
  • Min-Max Normalization
    The minimum value is subtracted from each value of that feature and then the result is divided by the range of values. Like the previous method values will now range between 0 and 1.
df["bmi"] = (df["bmi"] - df["bmi"].min())/(df["bmi"].max() 
- df["bmi"].min())
  • Z-score or Standard score
    From each value, average (μ) of that feature is subtracted and then the result is divided by the standard deviation (σ). Values resulting from this technique are usually near 0, but the typical range is -3 to +3.
df["bmi"] = (df["bmi"] - df["bmi"].mean())/df["bmi"].std()

Data Binning

Binning is grouping of values into bins to have a better understanding of the data. It can be done to group numerical values into smaller bins or convert numeric data to categorical data.

In our data set variable charges ranges from around 1120 to 63770.
To have a better representation and understanding, we can categorize charges into 3 categories or bins so that we know who are charged on a low, medium and high range.

To have 3 parts of the variable data, 4 numbers equidistant from each other are required to divide the data.
This can be achieved using linspace() method of numpy.

bins = np.linspace(df["charges"].min(), df["charges"].max(), 4)

Next create a list of the group names, i.e. Low, Medium, High.

group_names = ["Low", "Medium", "High"]

Then, we can use cut() method provided by pandas to segment and sort the values in charges into the required bins.

df["binned_charges"] = pd.cut(df["charges"], bins, 
labels = group_names)

You can then visualize the data and figure out how many people are charged in each manner.

Converting categorical variables to numerical values.

Most of the machine learning models don’t accept the object/string data type and so it becomes necessary for us to convert these values into numbers.
This can be done by adding adding dummy variables for each of the unique variables in the feature we would like to convert. Too complicated to understand? Follow the explanation:

You can see, our dataset consists of a feature named smoker with 2 values (yes & no). So in this case we can create 2 dummy features named by the values in the original smoker feature. And then assign 1 or 0 to each category in the new feature if it exists in the original one. This method is known One-Hot Encoding.

In Pandas, converting categorical variables to dummy variables is very easy with the help of get_dummies() function.

sex_num = pd.get_dummies(df["sex"]) 
sex_num.head()

Thank you for reading! Hope this post helped you understand the basics behind data preprocessing.

LinkedIn: https://www.linkedin.com/in/akshada-gaonkar-9b8886189/

--

--

Akshada Gaonkar
The Startup

Intern at SAS • MTech Student at NMIMS • Data Science Enthusiast!