MENU

[AI from Scratch] Episode 221: Implementing a Simple Regression Model — Linear Regression

TOC

Recap and Today’s Theme

Hello! In the previous episode, we explored the basics of data preprocessing, focusing on handling missing values and feature scaling. Proper preprocessing is essential as it prepares the data to be more suitable for the model, improving prediction accuracy. Understanding these preprocessing techniques is crucial for building effective machine learning models.

Today, we will use the preprocessed data to implement a linear regression model. Linear regression is one of the most fundamental models in machine learning, widely used for predicting trends in data. Using Scikit-learn, we’ll build a simple yet practical model. Let’s get started!

What Is Linear Regression?

Linear Regression is a technique that models the linear relationship between input variables (features) and output variables (target). For example, if you want to predict house prices based on their size, linear regression expresses how the price increases as the size of the house increases, using a formula.

The formula for linear regression is as follows:

[
y = w_0 + w_1 cdot x
]

  • ( y ): The predicted output (target variable)
  • ( w_0 ): Intercept (bias term)
  • ( w_1 ): Slope (coefficient)
  • ( x ): Input variable (feature)

This formula resembles the equation of a line, and the goal of linear regression is to find the line (regression line) that best fits the data.

Implementing a Linear Regression Model

Now, let’s implement a linear regression model using Scikit-learn. We’ll proceed through the following steps:

  1. Data Preparation
  2. Data Preprocessing
  3. Splitting the Data
  4. Building and Training the Model
  5. Evaluating and Visualizing the Model

1. Data Preparation

First, we’ll use Pandas to load a sample dataset. Here, we use the Boston dataset provided by Scikit-learn. This dataset contains information about house prices in Boston, with 13 features (e.g., house size, number of rooms, crime rate).

import pandas as pd
from sklearn.datasets import load_boston

# Loading the data
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target

# Checking the data
print(df.head())

2. Data Preprocessing

Next, we preprocess the data. Although linear regression does not strictly require scaling, scaling can stabilize the learning process. We’ll focus on the RM (number of rooms) and PRICE (house price) features for our model.

from sklearn.preprocessing import StandardScaler

# Selecting features and target
X = df[['RM']]  # Number of rooms
y = df['PRICE']  # Price

# Applying scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

3. Splitting the Data

We split the data into training and testing sets to evaluate the model’s performance with unseen data.

from sklearn.model_selection import train_test_split

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

4. Building and Training the Model

Using Scikit-learn’s LinearRegression class, we create a model and train it using the training data.

from sklearn.linear_model import LinearRegression

# Creating and training the model
model = LinearRegression()
model.fit(X_train, y_train)
  • LinearRegression(): Creates a linear regression model.
  • fit(): Trains the model using the training data and target values.

5. Evaluating and Visualizing the Model

To evaluate the model’s performance, we use the test data to make predictions and compare them with the actual values. We also visualize the results with a graph.

import matplotlib.pyplot as plt
import numpy as np

# Making predictions with the test data
y_pred = model.predict(X_test)

# Evaluating the model (R² score)
r2_score = model.score(X_test, y_test)
print(f"R² Score: {r2_score}")

# Plotting the graph
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.title('Linear Regression: Room Count vs. Price')
plt.xlabel('Number of Rooms (standardized)')
plt.ylabel('Price')
plt.legend()
plt.show()
  • predict(): Makes predictions using the test data.
  • score(): Evaluates the model’s performance using the R² score. An R² score close to 1 indicates high accuracy.

In the graph, blue dots represent the actual house prices, while red dots represent the predicted prices. This visualization helps us understand how closely the model’s predictions align with the actual data.

Summary

In this episode, we covered the basic implementation of a linear regression model. Using Scikit-learn, we can build, train, and evaluate a model with just a few lines of code. Linear regression is a foundational model in machine learning and is very effective for understanding relationships in data. Try applying the steps covered today to other datasets and features!

Next Episode Preview

Next time, we will explore classification models using logistic regression to handle classification tasks. Learning classification basics will open the door to various applications!


Annotations

  • Linear Regression: A method that models the linear relationship between input and output variables, suitable for predicting continuous values.
  • R² Score: A metric that indicates the accuracy of the model’s predictions, with values closer to 1 indicating better models.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC