MENU

Detecting and Handling Outliers (Learning AI from scratch : Part 22)

TOC

Recap of Last Time and Today’s Topic

Hello! In the last session, we learned how to handle missing data in datasets. Missing values are unavoidable in many cases, but by handling them appropriately, we can improve the accuracy of our models. Today, we will explore outliers—how to detect them and how to manage them effectively.

Outliers are values in a dataset that deviate significantly from the rest. These values may result from data collection errors or reflect rare, unusual circumstances. Properly detecting and handling outliers is crucial for maintaining model accuracy and stability.

What Are Outliers?

Definition of Outliers

Outliers refer to data points that differ significantly from other values in a dataset. They can distort the distribution of data and hinder the model’s ability to learn accurately. For example, in income data, a value like “100 million yen” would be considered an outlier if most other values are much lower.

Outliers can result from incorrect data entry, measurement errors, or actual abnormal events, so they should not be ignored.

Causes of Outliers

There are several reasons why outliers may appear:

  • Data collection errors: Misfunctioning sensors or faulty measurement devices may capture inaccurate data.
  • Human input errors: Manual data entry may introduce unusual values.
  • Measurements taken under abnormal conditions: Data recorded during rare or extreme conditions can result in outliers.

The presence of outliers can compromise the reliability of the dataset, making it important to detect and handle them appropriately.

Methods for Detecting Outliers

There are various methods for detecting outliers. Below are some commonly used techniques:

Statistical Methods

Statistical techniques can be used to detect outliers. These methods include using standard deviation and the interquartile range (IQR).

  • Standard Deviation: This method calculates how far data points deviate from the mean. Data points that fall outside a certain range (typically ±3 standard deviations) are considered outliers. For example, in a dataset with a standard deviation of 2, any point that deviates by more than 6 units from the mean is flagged as an outlier.
  • Interquartile Range (IQR): This method divides the data into quartiles and calculates the IQR, which is the difference between the first and third quartiles. Values that fall 1.5 times the IQR above or below the quartiles are considered outliers. For example, if the IQR is 10, any value that deviates by more than 15 units from the quartiles is treated as an outlier.

Visual Methods

Outliers can also be detected visually by plotting the data. Common visualization methods include box plots and scatter plots.

  • Box Plot: A visual representation of data distribution, where outliers appear as points outside the “whiskers.” This method allows you to quickly assess the overall distribution and spot any anomalies.
  • Scatter Plot: In a two-dimensional dataset, scatter plots help identify data points that fall far from the rest, making it easier to spot outliers that deviate from expected relationships or patterns.

Machine Learning-Based Methods

More advanced methods involve using machine learning techniques to detect outliers, such as clustering methods and Isolation Forest.

  • Clustering Methods: Data is grouped into clusters, and values that fall far outside their cluster are flagged as outliers. For example, using k-means clustering, data points that are far from the cluster center can be identified as outliers.
  • Isolation Forest: This is a machine learning model specifically designed to detect outliers. It creates multiple random decision trees and uses the properties of outliers—isolated and different from other data points—to identify them.

Methods for Handling Outliers

Once outliers have been detected, the next step is to decide how to handle them. Here are several common approaches:

Deleting Outliers

The simplest way to handle outliers is to delete them from the dataset. This method is appropriate when the outliers are few and removing them won’t significantly affect the representativeness of the data.

However, caution is necessary. If the dataset contains many outliers, or if the outliers reflect important characteristics of the data, deletion might not be the best option.

Replacing Outliers

Instead of deleting outliers, you can replace them with more reasonable values, such as the median or mean of the data.

For example, if an unusually high income value is detected, replacing it with the median income helps maintain the overall balance of the dataset while reducing the impact of the outlier.

Using Robust Statistical Methods

Robust statistical methods are designed to build models that are less sensitive to outliers. This allows the model to learn accurately even when outliers are present.

For instance, instead of using linear regression, you can use robust regression techniques that are less affected by outliers. These methods enable the model to maintain its accuracy without needing to remove or adjust the outliers.

The Importance of Handling Outliers

Handling outliers is crucial for improving the accuracy and reliability of models. Failing to address outliers can lead to models drawing incorrect conclusions, which can have serious consequences, especially in fields like business and healthcare. Properly detecting and handling outliers optimizes model performance and enhances trust in the results.

Coming Up Next

In this session, we learned how to detect and handle outliers in datasets. By properly addressing outliers, we can maintain the accuracy and reliability of our models. Next time, we will explore data standardization and normalization, methods that align the scale of data values. Let’s continue learning together!

Summary

In this session, we covered the concept of outliers, values that deviate significantly from the rest of the data. We also discussed methods for detecting and handling these outliers to improve model performance. In the next session, we will delve into data standardization and normalization, so stay tuned!


Notes

  • Outliers: Data points that significantly differ from others in the dataset and may negatively affect model learning.
  • Standard Deviation: A measure that indicates how far data points are from the mean. The larger the standard deviation, the greater the data variability.
  • Interquartile Range (IQR): A measure of data spread that helps detect outliers by looking at the middle 50% of data.
  • Box Plot: A visual tool used to represent the distribution of data, with outliers appearing as points outside the whiskers.
  • Isolation Forest: A machine learning algorithm specifically designed to detect outliers.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC