Recap and Today’s Theme
Hello! In the previous episode, we implemented a linear regression model, one of the most fundamental models in machine learning, to predict trends based on data. Linear regression is a simple yet powerful method for predicting continuous numerical values.
This time, we will focus on logistic regression, a model used for classification tasks. Logistic regression is an algorithm designed to classify data into two or more categories, such as binary classification (e.g., 0 and 1, True and False). We will implement a logistic regression model using Scikit-learn and learn how it works and how to apply it effectively.
What Is Logistic Regression?
Logistic Regression is a statistical model used to solve classification problems. Despite its name, it is actually used for classification rather than regression tasks. Unlike linear regression, which uses a straight line, logistic regression uses an S-shaped curve called the sigmoid function to classify data into two or more classes.
The formula for logistic regression is as follows:
[
P(y=1|x) = frac{1}{1 + e^{-(w_0 + w_1 cdot x)}}
]
- ( P(y=1|x) ): The probability that the input data ( x ) belongs to class 1 (e.g., True, 1)
- ( w_0 ): Intercept (bias term)
- ( w_1 ): Weight associated with the feature
- ( e ): The base of the natural logarithm (approximately 2.718)
This formula calculates the probability that the output belongs to class 1 based on the input variable.
Implementing a Logistic Regression Model
Let’s implement a logistic regression model using Scikit-learn by following these steps:
- Data Preparation
- Data Preprocessing
- Splitting the Data
- Model Building and Training
- Model Evaluation and Prediction
1. Data Preparation
We will use the breast_cancer
dataset provided by Scikit-learn. This dataset is based on breast cancer diagnostics and represents the diagnosis results in binary form (0: benign, 1: malignant).
import pandas as pd
from sklearn.datasets import load_breast_cancer
# Loading the dataset
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
# Checking the data
print(df.head())
This dataset contains 30 features (e.g., tumor size, shape) and a target variable (target
) indicating whether the diagnosis is benign (0) or malignant (1).
2. Data Preprocessing
We scale the features because logistic regression is sensitive to the scale of the input data. Standardizing the data improves model accuracy.
from sklearn.preprocessing import StandardScaler
# Selecting features and target
X = df.drop('target', axis=1)
y = df['target']
# Applying scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
3. Splitting the Data
Next, we split the data into training and testing sets.
from sklearn.model_selection import train_test_split
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
- train_test_split(): Divides the data into training and testing sets. Setting
test_size=0.2
uses 20% of the data for testing.
4. Model Building and Training
We create a logistic regression model and train it with the training data.
from sklearn.linear_model import LogisticRegression
# Creating and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
- LogisticRegression(): Instantiates a logistic regression model.
- fit(): Trains the model using the training data and target labels.
5. Model Evaluation and Prediction
To evaluate the model’s performance, we use the test data to make predictions and calculate the accuracy.
# Making predictions with the test data
y_pred = model.predict(X_test)
# Evaluating the model (accuracy)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
- predict(): Uses the test data to predict the classes.
- score(): Evaluates the model’s performance by calculating the accuracy, which is the percentage of correctly classified samples.
Visualizing and Analyzing the Model
To gain a deeper understanding of the model’s performance, we use a confusion matrix and ROC curve.
Confusion Matrix
A confusion matrix provides detailed insight into prediction results, showing how well the model classified benign (class 0) and malignant (class 1) cases.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Calculating the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Visualizing the confusion matrix
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
ROC Curve and AUC Score
We use the ROC curve and AUC score to further evaluate the model’s performance.
from sklearn.metrics import roc_curve, roc_auc_score
# Calculating prediction probabilities
y_prob = model.predict_proba(X_test)[:, 1]
# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, marker='.')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
# Calculating the AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"AUC Score: {auc_score}")
- roc_curve(): Creates an ROC curve by calculating the False Positive Rate (FPR) and True Positive Rate (TPR).
- roc_auc_score(): Calculates the AUC (Area Under the Curve) score, indicating model performance. The closer the AUC score is to 1, the better the model.
Summary
In this episode, we implemented a basic logistic regression model. Logistic regression is a fundamental model for classification tasks and is widely used for binary and multi-class classification. Scikit-learn simplifies the entire process from data preprocessing to model training and evaluation. Try experimenting with other datasets to deepen your understanding of classification tasks.
Next Episode Preview
Next time, we will explore model evaluation methods, learning how to evaluate model performance using various metrics such as precision, recall, and F1-score for more in-depth analysis!
Annotations
- Accuracy: The proportion of correctly predicted samples out of all samples.
- ROC Curve: A curve used to visually evaluate model classification performance by showing the relationship between FPR and TPR. A higher AUC score indicates better performance.
Comments