MENU

Data Standardization and Normalization (Learning AI from scratch : Part 23)

TOC

Recap of Last Time and Today’s Topic

Hello! In the last session, we learned how to detect and handle outliers in datasets. Properly addressing outliers improved the accuracy and reliability of our models. Today, we will cover an important step in preparing data for model training: data standardization and normalization.

Standardization and normalization are techniques used to align the scale of different data types. When datasets contain variables with different units or scales, models may struggle to learn effectively. To prevent this, standardization and normalization help ensure that all features are on a comparable scale. Let’s dive into these concepts and explore their applications.

What is Data Standardization?

Definition of Standardization

Standardization refers to transforming data so that it has a mean of 0 and a standard deviation of 1. This ensures that variables with different scales can be compared in a uniform way.

For example, consider a dataset containing both height (in centimeters) and weight (in kilograms). Since these two features have different units, the scales will differ significantly. By standardizing the data, these features can be evaluated equally in the model.

How to Calculate Standardization

Standardizing data is done using the following formula:

Standardized value = (Original value - Mean) / Standard deviation

For instance, if the average height in a dataset is 170 cm with a standard deviation of 10 cm, the standardized value of a height of 180 cm would be:

Standardized value = (180 - 170) / 10 = 1

This result indicates that a height of 180 cm is 1 standard deviation above the mean.

Benefits and Considerations of Standardization

One of the key benefits of standardization is that it balances the impact of different features, ensuring the model doesn’t give undue weight to features with larger numerical values. This is especially important in algorithms like linear regression or support vector machines (SVM), where the scale of the data directly affects performance.

However, there are some considerations. For example, outliers can skew the mean and standard deviation, affecting the standardization process. Therefore, it’s recommended to handle outliers before standardizing the data.

What is Data Normalization?

Definition of Normalization

Normalization involves rescaling data to fit within a specific range, typically between 0 and 1. This method ensures that all features have the same scale, allowing the model to treat each feature equally.

For example, in a dataset with income (in millions of yen) and age (in years), income values will be much larger than age values, leading the model to focus more on income. By normalizing the data, we can ensure that all features are on the same scale and the model can learn from them fairly.

How to Calculate Normalization

Normalization can be done using the following formula:

Normalized value = (Original value - Minimum value) / (Maximum value - Minimum value)

For instance, if income values in a dataset range from 2 million to 10 million yen, the normalized value of 6 million yen would be:

Normalized value = (6 - 2) / (10 - 2) = 0.5

This indicates that 6 million yen is in the middle of the data’s range.

Benefits and Considerations of Normalization

Normalization ensures that all data is processed on the same scale, which helps models such as neural networks perform more reliably, as large differences in scale can lead to instability during training.

However, one consideration is that normalization depends on the range of the data. If new data is introduced, the normalization process must be redone. Additionally, if the data distribution is skewed, normalization can result in values clustered near 0 or 1, which may not be ideal. In such cases, alternative techniques should be considered.

When to Use Standardization and Normalization

Standardization and normalization both align the scale of data but are suited to different scenarios:

  • Standardization is best used when the data follows a normal distribution or when algorithms like linear regression or SVM are being applied. It reduces the variability in the data while maintaining the relative relationships between features.
  • Normalization is most useful when the data spans a wide range or when algorithms like neural networks are used. By compressing data into the 0 to 1 range, normalization helps models train more effectively and efficiently.

These techniques are essential for optimizing model performance, and the choice between them depends on the nature of the data and the algorithms being used.

Practical Example and Case Study

Imagine building a model to predict housing prices. The dataset includes features like house size (in square meters), price (in millions of yen), and building age (in years). Since these features have different units, using them without scaling could lead the model to overly rely on price data.

By standardizing these features, we ensure that all data points are on a comparable scale, allowing the model to consider each feature equally. In cases where a neural network is used, normalization helps ensure stable training and improves the accuracy of the predictions.

Coming Up Next

In this session, we discussed the important techniques of data standardization and normalization. By aligning the scale of the data, we allow models to learn more effectively from each feature. In the next session, we will cover categorical variable encoding, which involves converting text data into numerical form for model training. Let’s continue learning together!

Summary

In this session, we explored standardization and normalization, two essential techniques for aligning the scale of data. These methods ensure that models learn equally from all features, leading to more accurate predictions. In the next session, we will discuss encoding techniques for converting categorical variables into numerical data. Stay tuned!


Notes

  • Standardization: A method for transforming data so that the mean is 0 and the standard deviation is 1, reducing variability while preserving relative relationships between features.
  • Normalization: A method for rescaling data to fit between 0 and 1, ensuring that all features are on the same scale.
  • Support Vector Machine (SVM): A machine learning algorithm used for classification tasks.
  • Neural Networks: A type of artificial intelligence model that mimics the neural circuits of the human brain.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC