MENU

Categorical Variable Encoding (Learning AI from scratch : Part 24)

TOC

Recap of Last Time and Today’s Topic

Hello! Last time, we learned about data standardization and normalization, which help ensure that a model can learn evenly from different features by aligning the data’s scale. Today, we will discuss categorical variable encoding, a technique used to convert text-based information in data into numerical form.

Categorical variables are variables that classify data into specific categories or classes. Examples include gender (male, female) or occupation (engineer, designer, marketer). These types of variables cannot be directly input into machine learning models, so they need to be transformed into numerical data. In this session, we’ll explore how to do that.

What Are Categorical Variables?

Definition of Categorical Variables

Categorical variables are data points that fall into distinct categories or classes. Some common examples include:

  • Gender: Male, Female
  • Occupation: Engineer, Designer, Marketer
  • Location: Tokyo, Osaka, Fukuoka

Since these data points are represented by text, they cannot be directly understood by machine learning models, which require numerical input. Therefore, categorical variables must be converted into numerical data for models to process them effectively.

Types of Categorical Variables

Categorical variables can be divided into two main types:

  • Nominal Variables: Categories that have no inherent order. Example: Colors (Red, Blue, Green)
  • Ordinal Variables: Categories that follow a specific order. Example: Education level (Elementary, Middle, High School, University)

Each type requires a different encoding method to convert it into numerical form.

Methods for Encoding Categorical Variables

Let’s look at some popular methods for converting categorical variables into numerical values:

One-Hot Encoding

One-Hot Encoding transforms categorical variables into binary vectors (0s and 1s). This method is particularly suited for nominal variables.

For example, if you have data on gender (Male, Female), one-hot encoding would convert the data as follows:

  • Male: [1, 0]
  • Female: [0, 1]

In this way, each category is represented as a separate column, and each row contains binary data indicating whether the category is present.

Label Encoding

Label Encoding assigns integers to categorical variables. This approach works best for ordinal variables where the order between categories matters.

For instance, for education levels (Elementary, Middle, High School, University), label encoding would convert the data as follows:

  • Elementary: 0
  • Middle: 1
  • High School: 2
  • University: 3

This method preserves the order of the categories, making it suitable for ordinal data. However, caution is needed when applying it to nominal variables, as it may incorrectly imply a relationship between categories.

Target Encoding

Target Encoding converts categorical variables by replacing them with the mean or median value of the target variable for each category. This method is useful when working with large datasets with many categories.

For example, if you have housing price data for different regions, target encoding could replace each region with its average house price. This technique allows for more compact representation of categorical data, especially when the number of categories is high.

Frequency Encoding

Frequency Encoding uses the frequency of each category in the dataset to replace categorical values. This method works well when certain categories appear more frequently in the data.

For instance, if the dataset includes job titles and “Engineer” appears 50 times, “Designer” 30 times, and “Marketer” 20 times, the data would be encoded as:

  • Engineer: 0.5
  • Designer: 0.3
  • Marketer: 0.2

This method captures the prevalence of each category, which can be useful when frequency is an important factor.

Choosing the Right Encoding Method

When choosing an encoding method for categorical variables, several factors should be considered:

  • Type of Data: Determine whether the data is nominal or ordinal and select an appropriate encoding method accordingly.
  • Number of Categories: For large numbers of categories, one-hot encoding may create too many columns, so alternative methods like target encoding may be more efficient.
  • Compatibility with the Model: The optimal encoding method can vary depending on the machine learning model used. For example, decision trees often work well with label encoding, while linear models like one-hot encoding.

Choosing the right encoding method can directly impact model performance and computational efficiency.

Practical Example and Case Study

For instance, when building a model to predict customer purchase behavior from marketing data, you might need to encode variables like customer occupation and location. Since occupation is nominal, one-hot encoding would effectively represent each job category. If the location data includes a large number of categories, target encoding could be used to optimize model performance by summarizing each location’s contribution to the target variable.

Coming Up Next

Today, we covered how to transform text data into numerical form using categorical variable encoding. Proper encoding allows the model to learn effectively from the data, resulting in more accurate predictions. In the next session, we’ll explore feature selection—the process of selecting the most useful features for improving model performance. Stay tuned!

Summary

In this session, we explored the different techniques for categorical variable encoding. By choosing the right encoding method, models can effectively learn from the data’s characteristics, leading to higher prediction accuracy. Next time, we will discuss feature selection, so look forward to it!


Notes

  • Categorical Variable: Variables that classify data into categories or classes, such as gender or occupation.
  • Nominal Variables: Categories with no inherent order (e.g., colors).
  • Ordinal Variables: Categories with a specific order (e.g., education levels).
  • One-Hot Encoding: A method that converts categorical variables into binary vectors (0s and 1s).
  • Label Encoding: A method that assigns integers to categorical variables, preserving the order.
  • Target Encoding: A method that replaces categories with the average or median value of the target variable for each category.
  • Frequency Encoding: A method that encodes categories based on their frequency in the dataset.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC