[AI from Scratch] Episode 223: Practical Model Evaluation — Assessing the Performance of Implemented Models

2024年10月11日

TOC

Recap and Today’s Theme

Hello! In the previous episode, we learned how to solve classification problems using a logistic regression model. We covered the fundamental steps, from data preprocessing to model training, prediction, and evaluation.

Today, we will focus on evaluating machine learning models. Model evaluation is crucial for understanding how well a constructed model performs. In addition to accuracy, we will explore metrics such as precision, recall, F1 score, and AUC score to perform a deeper analysis. Let’s dive into these evaluation methods!

Basics of Model Evaluation

To accurately evaluate model performance, it is essential to split the data into training and test sets. The model is trained on the training data and evaluated on the test data to measure its performance on unseen data. This approach helps determine how well the model can make predictions on unknown inputs.

Using Scikit-learn, we will follow these steps to implement model evaluation:

Model Prediction and Evaluation Metrics
Evaluating with a Confusion Matrix
Precision, Recall, and F1 Score
ROC Curve and AUC Score

1. Model Prediction and Evaluation Metrics

Using the logistic regression model from the previous episode, we will first evaluate the model using basic accuracy.

Model Preparation

First, we use Scikit-learn’s breast_cancer dataset to train a logistic regression model and prepare it for evaluation on the test data.

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Loading and preparing the data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Separating features and target
X = df.drop('target', axis=1)
y = df['target']

# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Creating and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

Calculating Basic Accuracy

# Predicting using the test data
y_pred = model.predict(X_test)

# Calculating model accuracy
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy indicates the proportion of correctly predicted instances out of the total dataset. However, accuracy alone is sometimes insufficient, so other evaluation metrics are also used.

2. Evaluating with a Confusion Matrix

The confusion matrix is used for detailed analysis of a model’s performance in classification tasks. It includes the following components:

TP (True Positive): Correctly predicted positive class (e.g., diagnosed as malignant correctly).
TN (True Negative): Correctly predicted negative class (e.g., diagnosed as benign correctly).
FP (False Positive): Incorrectly predicted as positive when it is negative.
FN (False Negative): Incorrectly predicted as negative when it is positive.

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Calculating the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualizing the confusion matrix
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

By examining the confusion matrix, you can identify which classes the model predicts correctly and which ones it misclassifies.

3. Precision, Recall, and F1 Score

From the confusion matrix, we can calculate the following evaluation metrics:

Precision: The proportion of true positives among all positive predictions.
Recall: The proportion of true positives among all actual positives.
F1 Score: The harmonic mean of precision and recall.

from sklearn.metrics import precision_score, recall_score, f1_score

# Calculating precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

# Calculating recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

# Calculating F1 score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")

Precision is particularly important when it is necessary to reduce false positives.
Recall is crucial when minimizing false negatives is important.
F1 Score balances precision and recall, providing an overall measure of model performance.

4. ROC Curve and AUC Score

To visualize model performance, we use the ROC curve and AUC score.

ROC Curve: A plot of the False Positive Rate (FPR) against the True Positive Rate (TPR).
AUC Score: The area under the ROC curve. The closer the score is to 1, the better the model performs.

from sklearn.metrics import roc_curve, roc_auc_score

# Calculating prediction probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, marker='.')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

# Calculating the AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"AUC Score: {auc_score:.2f}")

An AUC score close to 1 indicates strong model performance. If the score is around 0.5, the model performs no better than random guessing.

Summary

In this episode, we explored various methods for evaluating machine learning models. Beyond accuracy, metrics like the confusion matrix, precision, recall, F1 score, and the ROC curve with AUC score provide deeper insights into model performance. By combining these metrics, you can identify areas for improvement and build more accurate models.

Next Episode Preview

Next time, we will cover hyperparameter tuning using Grid Search to find the optimal parameters for models. Learn how to maximize model performance through parameter adjustments!

Annotations

Precision: The proportion of correct positive predictions out of all positive predictions made by the model.
Recall: The proportion of actual positives correctly predicted by the model.
ROC Curve: A graphical plot that shows the diagnostic ability of a binary classifier system, with the AUC score indicating performance quality.

Let's share this post !

Copied the URL !

Copied the URL !

Author of this article

株式会社PROMPT

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.