MENU

Lesson 126: Scaling Numerical Data

TOC

Recap: Preprocessing Text Data

In the previous lesson, we explored tokenization, stemming, and lemmatization—techniques for preprocessing text data. These methods transform natural language data into a format suitable for machine learning models, enhancing model accuracy. Proper text preprocessing is crucial for improving the performance of models.

Today, we will discuss scaling numerical data, a technique that adjusts the range of numerical data, significantly impacting the efficiency and accuracy of machine learning models. We will look closely at two common methods: Min-Max Scaling and Standardization.


What is Scaling Numerical Data?

Scaling numerical data refers to the process of adjusting the range and distribution of data to ensure consistency. The purpose of scaling is to improve model performance by handling data with different ranges or units uniformly.

Example: Understanding Scaling Numerical Data

Scaling numerical data is like standardizing shoe sizes. Different brands may have slight variations in size measurements, but by unifying them, people can choose shoes from any brand with confidence. Similarly, scaling ensures that data with different ranges can be handled consistently.


What is Min-Max Scaling?

Min-Max Scaling is a method that transforms data into a range between 0 and 1. It sets the minimum value of the data to 0 and the maximum value to 1, positioning all other data points proportionally within this range. This preserves the distribution of the data while standardizing its scale.

Formula for Min-Max Scaling

Min-Max scaling is calculated using the following formula:

[
x’ = \frac{x – x_{\text{min}}}{x_{\text{max}} – x_{\text{min}}}
]

Where:

  • ( x ) is the original value,
  • ( x_{\text{min}} ) is the minimum value in the dataset,
  • ( x_{\text{max}} ) is the maximum value in the dataset,
  • ( x’ ) is the scaled value.

Example: Min-Max Scaling in Practice

Consider the following dataset:

Value
10
20
30
40
50

Applying Min-Max scaling results in:

ValueScaled Value
100.0
200.25
300.5
400.75
501.0

This transformation ensures all values fall within the 0 to 1 range, maintaining the original distribution.

Advantages and Disadvantages of Min-Max Scaling

Advantages

  • Easy to Implement: Min-Max scaling is simple and computationally lightweight.
  • Preserves Distribution: It maintains the original distribution of the data while adjusting the range.

Disadvantages

  • Sensitive to Outliers: If the dataset contains outliers, they can significantly skew the range, causing other data points to be compressed.

Example: Understanding Min-Max Scaling

Min-Max scaling is like stretching dough evenly. Just as dough is stretched while maintaining its shape, Min-Max scaling adjusts the range while keeping the distribution intact.


What is Standardization?

Standardization transforms data so that it has a mean of 0 and a standard deviation of 1. This method centers the data around zero and scales it based on variability, making it suitable for handling different units or ranges.

Formula for Standardization

Standardization is calculated using the following formula:

[
x’ = \frac{x – \mu}{\sigma}
]

Where:

  • ( x ) is the original value,
  • ( \mu ) is the mean of the dataset,
  • ( \sigma ) is the standard deviation of the dataset,
  • ( x’ ) is the standardized value.

Example: Standardization in Practice

Consider the following dataset:

Value
10
20
30
40
50

The mean is 30, and the standard deviation is 14.14. Applying standardization yields:

ValueStandardized Value
10-1.41
20-0.71
300.0
400.71
501.41

The values are now centered around zero with a consistent spread.

Advantages and Disadvantages of Standardization

Advantages

  • Resistant to Outliers: Standardization is less influenced by outliers and works well with data having different ranges.
  • Suitable for Normally Distributed Data: It is particularly effective when the data follows a normal distribution.

Disadvantages

  • May Alter Distribution: Standardization can change the shape of the data, making it unsuitable if preserving the original distribution is necessary.

Example: Understanding Standardization

Standardization is like converting test scores to percentile ranks. Regardless of the test, if two scores have the same percentile rank, their relative performance is equivalent. Similarly, standardization unifies the spread of data.


When to Use Min-Max Scaling vs. Standardization

Choosing between Min-Max scaling and standardization depends on the nature of the data:

When to Use Min-Max Scaling

  • When data has a fixed range.
  • When there are few outliers.
  • When you need to preserve the original distribution.

When to Use Standardization

  • When the dataset contains many outliers.
  • When the data is close to a normal distribution.
  • When the machine learning model requires standardized input (e.g., support vector machines, logistic regression).

Conclusion

In this lesson, we explored scaling numerical data. Min-Max Scaling standardizes data into the 0-1 range while preserving the distribution, making it a straightforward and efficient method. Standardization, on the other hand, centers the data at zero with a unit standard deviation, making it effective for handling different ranges or units. Proper scaling can greatly enhance machine learning model performance.


Next Topic: Feature Engineering

In the next lesson, we will discuss Feature Engineering—the techniques for creating new features from existing data to improve model performance.


Notes

  1. Min-Max Scaling: A method that scales data to a range of 0 to 1.
  2. Standardization: A technique that transforms data to have a mean of 0 and a standard deviation of 1.
  3. Standard Deviation: A measure of data spread, calculated as the square root of variance.
  4. Normal Distribution: A symmetrical distribution centered around the mean, resembling a bell curve.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC