MENU

[AI from Scratch] Episode 219: Introduction to Scikit-learn — Basics of a Machine Learning Library

TOC

Recap and Today’s Theme

Hello! In the previous episode, we discussed how to create beautiful and advanced graphs using the data visualization library Seaborn. By leveraging Seaborn, it’s easier to visually grasp data distributions and correlations.

Today, we will learn the basics of Scikit-learn, a widely-used machine learning library in Python. Scikit-learn is a powerful tool that allows for consistent handling of data preprocessing, model building, and evaluation. It is used by data scientists and AI engineers, from beginners to professionals. Let’s explore its basic operations!

What Is Scikit-learn?

Scikit-learn is a Python library designed for implementing machine learning. It offers several features:

  1. Rich Set of Machine Learning Algorithms: Includes algorithms for regression, classification, clustering, and dimensionality reduction.
  2. Comprehensive Data Preprocessing Tools: Provides tools for handling missing values, scaling, encoding, and splitting data, all essential for building models.
  3. Simple API: Its consistent API design makes it easy to build, train, and predict models.

Installing Scikit-learn

Scikit-learn is usually included in Anaconda environments, but if you need to install it manually, use the following command:

pip install scikit-learn

Importing Scikit-learn

First, import the necessary modules.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

With this setup, you are ready to start using Scikit-learn for machine learning.

Basic Machine Learning Workflow with Scikit-learn

The basic steps for using Scikit-learn in machine learning are as follows:

  1. Data Preparation: Load the dataset.
  2. Data Preprocessing: Handle missing values, scale data, and encode variables as necessary.
  3. Data Splitting: Split the data into training and testing sets.
  4. Model Selection and Training: Choose an algorithm and train the model.
  5. Model Evaluation: Evaluate the model’s accuracy using the test data.

1. Data Preparation

Let’s start by loading a dataset using Pandas. We will use the diabetes dataset from Scikit-learn as an example.

import pandas as pd
from sklearn.datasets import load_diabetes

# Loading the dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target

# Displaying the first few rows
print(df.head())

The load_diabetes function loads a dataset containing features and a target variable related to diabetes. This dataset has 10 features and one target variable.

2. Data Preprocessing

Prepare the data for the machine learning model by scaling the features.

# Separating features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
  • StandardScaler(): Standardizes features by setting their mean to 0 and variance to 1, which is crucial for improving model performance.

3. Splitting the Data

Split the data into training and testing sets.

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
  • train_test_split(): Divides the data into training and test sets. test_size specifies the proportion of test data, and random_state ensures reproducibility.

4. Model Selection and Training

Here, we use a linear regression model. In Scikit-learn, you create a model instance and train it using the fit method.

# Creating and training the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
  • LinearRegression(): Creates a linear regression model.
  • fit(): Trains the model using the training data and target values.

5. Model Evaluation

After training, evaluate the model’s accuracy using the test data.

# Making predictions
y_pred = model.predict(X_test)

# Calculating the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
  • predict(): Makes predictions using the test data.
  • mean_squared_error(): Computes the mean squared error as a measure of model accuracy.

Models in Scikit-learn

Scikit-learn offers a variety of machine learning algorithms. Here are some commonly used models:

1. Linear Regression

Linear regression is a basic algorithm for predicting continuous values. It can be easily implemented using the LinearRegression class, as shown earlier.

2. Logistic Regression

Logistic regression is used for classification tasks, such as binary classification (0 or 1).

from sklearn.linear_model import LogisticRegression

# Creating and training the logistic regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Making predictions and evaluating
y_pred = clf.predict(X_test)

3. Decision Tree

Decision trees split data based on conditions to classify or predict outcomes.

from sklearn.tree import DecisionTreeClassifier

# Creating and training a decision tree model
tree_clf = DecisionTreeClassifier(max_depth=3)
tree_clf.fit(X_train, y_train)

# Making predictions and evaluating
y_pred = tree_clf.predict(X_test)
  • DecisionTreeClassifier(): A model for classification using decision trees. max_depth limits the depth of the tree.

4. k-Nearest Neighbors (k-NN)

k-NN uses the classes of the k nearest neighbors to predict the class of a data point.

from sklearn.neighbors import KNeighborsClassifier

# Creating and training the k-NN model
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)

# Making predictions and evaluating
y_pred = knn_clf.predict(X_test)
  • KNeighborsClassifier(): Implements the k-NN algorithm. n_neighbors specifies the value of k.

Summary

In this episode, we covered the basics of using Scikit-learn. With Scikit-learn, you can efficiently implement a consistent workflow for reading data, preprocessing it, training models, and evaluating their performance. Additionally, it provides various machine learning algorithms, making it easier to build models effectively.

Next Episode Preview

Next time, we will focus on data preprocessing, including handling missing values and feature scaling. Data preprocessing significantly impacts the accuracy of machine learning models, so don’t miss it!


Annotations

  • Linear Regression: A regression technique that models the linear relationship between inputs and outputs.
  • Logistic Regression: A classification model that predicts classes based on probabilities.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC