MENU

[AI from Scratch] Episode 220: Practical Data Preprocessing — Handling Missing Values and Feature Scaling

TOC

Recap and Today’s Theme

Hello! In the previous episode, we explored building and evaluating basic machine learning models using Python’s Scikit-learn library. By learning the steps of data splitting, model training, and evaluation, you should now understand the fundamental flow of machine learning.

Today, we will focus on data preprocessing, an essential step for building machine learning models. Data preprocessing is crucial for improving model accuracy. It involves normalizing data, handling missing values, and preparing the dataset for the model. Proper preprocessing significantly enhances model performance and prediction accuracy. Let’s go through the basics of data preprocessing!

What Is Data Preprocessing?

Data preprocessing involves transforming and shaping the data to make it suitable for a machine learning model to learn effectively. It typically includes the following steps:

  1. Handling Missing Values: Filling or removing missing data to make the dataset easier for the model to work with.
  2. Feature Scaling: Normalizing feature scales to improve model performance.
  3. Encoding Categorical Variables: Converting categorical data into numerical form.
  4. Splitting the Data: Dividing the dataset into training and testing sets.

In this episode, we will focus on two important aspects: handling missing values and feature scaling.

1. Handling Missing Values

Datasets often contain missing values, and if not handled properly, these can decrease model accuracy. We will use Scikit-learn and Pandas to manage missing values.

Preparing a Dataset

First, let’s create a sample dataset using Pandas.

import pandas as pd
import numpy as np

# Creating a sample dataset
data = {
    'Age': [25, np.nan, 35, 40, np.nan],
    'Salary': [50000, 60000, np.nan, 80000, 90000],
    'Gender': ['Male', 'Female', 'Female', 'Male', np.nan]
}

df = pd.DataFrame(data)
print(df)

The above dataset contains missing values in the Age, Salary, and Gender columns.

Checking for Missing Values

First, check how many missing values are present in the DataFrame.

# Checking for missing values
print(df.isnull().sum())

The isnull() method reveals the number of missing values in each column.

Imputing Missing Values

There are several ways to fill in missing values:

  1. Mean or Median Imputation: For numerical data, fill in missing values with the mean or median of the column.
  2. Mode Imputation: For categorical data, fill in missing values with the most frequent value (mode).
  3. Specific Value Imputation: Fill missing values with a specific value, like 0.
# Imputing missing values with the mean (for the 'Age' column)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Imputing missing values with the mode (for the 'Gender' column)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

# Displaying the results
print(df)

Dropping Missing Values

In some cases, it may be effective to remove rows or columns with missing values.

# Dropping rows with missing values
df_dropped = df.dropna()

# Dropping columns with missing values
df_dropped_cols = df.dropna(axis=1)

The dropna() method removes rows or columns containing missing values.

2. Feature Scaling

Feature scaling is crucial, especially for algorithms based on distance (e.g., k-nearest neighbors or SVM). If features have different scales, model performance may degrade, so it’s important to standardize or normalize them.

Methods for Scaling

  1. Standardization: Transforms data to have a mean of 0 and a standard deviation of 1 using StandardScaler.
  2. Normalization: Scales data to a range of 0 to 1 using MinMaxScaler.

Scaling with Scikit-learn

Below is the code for scaling data using StandardScaler and MinMaxScaler.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Selecting features (numerical data only)
X = df[['Age', 'Salary']]

# Standardization
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
print("Standardized Data:\n", X_standardized)

# Normalization
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print("Normalized Data:\n", X_normalized)
  • StandardScaler(): Standardizes data, transforming it to have a mean of 0 and a standard deviation of 1, making features have the same scale.
  • MinMaxScaler(): Normalizes data to a range between 0 and 1, improving model accuracy by bringing all features to the same range.

Encoding Categorical Variables

Categorical variables (e.g., gender or location) need to be converted into numerical form using LabelEncoder or OneHotEncoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
le = LabelEncoder()
df['Gender_encoded'] = le.fit_transform(df['Gender'])
print(df)

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Gender'])
print(df_encoded)
  • LabelEncoder: Encodes categories as integers (e.g., Male → 1, Female → 0).
  • get_dummies(): Applies one-hot encoding, creating new columns for each category, represented with 1s and 0s.

Summary

In this episode, we discussed the basics of data preprocessing, specifically focusing on handling missing values and feature scaling. Proper preprocessing allows the model to learn data more accurately, thereby improving prediction accuracy. Understanding data preprocessing is vital for building sophisticated models.

Next Episode Preview

Next time, we will cover implementing a simple regression model. We will learn how to implement a linear regression model and make predictions based on the data!


Annotations

  • Standardization: A method that scales data to have a mean of 0 and a standard deviation of 1, ensuring consistent scales across features.
  • Normalization: A method that scales data to fit within a range of 0 to 1, making all features fall within the same range.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC