MENU

[AI from Scratch] Episode 250: Implementing Document Classification

TOC

Recap and Today’s Theme

Hello! In the previous episode, we learned about FastText, a word embedding method that uses subword information to handle unknown words and morphological variations, making it particularly useful for tasks like document classification and machine translation.

Today, we will build a news article category classification model through a hands-on implementation. Document classification is the task of predicting which category a given text belongs to, and it has various applications, such as categorizing news articles or filtering spam emails. In this article, we will implement a document classification model using Python, focusing on classifying news articles into categories.

Overview of Document Classification

1. What is Document Classification?

Document classification is a task that involves categorizing text data into predefined categories. It is a widely used method in machine learning and natural language processing (NLP) for applications like news categorization, sentiment analysis, and spam detection. This technology is applied in many real-world scenarios.

2. Steps in Document Classification

The typical steps for performing document classification are as follows:

  1. Data Collection: Gather text data for classification.
  2. Data Preprocessing: Clean and process the text data by tokenizing and removing stopwords.
  3. Feature Extraction: Convert the text into numerical vectors using techniques such as BoW, TF-IDF, or word embeddings (e.g., Word2Vec, FastText).
  4. Model Construction: Train a classifier (e.g., logistic regression, support vector machines, or neural networks).
  5. Model Evaluation: Evaluate the model’s performance using metrics such as accuracy, recall, and F1 score.

Implementing a News Article Category Classification Model

We will implement a news article classification model using Python, following these steps:

1. Preparing the Dataset

First, we prepare the dataset. We will use the widely available “20 Newsgroups” dataset, which contains 20 different categories of news articles.

from sklearn.datasets import fetch_20newsgroups

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='train', categories=['rec.sport.baseball', 'rec.sport.hockey', 'sci.space', 'comp.graphics'])

# Retrieve the data and target labels
texts = newsgroups.data
labels = newsgroups.target

In this code, we select categories such as sports (baseball, hockey) and science (space), and load the dataset.

2. Data Preprocessing

Next, we preprocess the data by cleaning the text, tokenizing it, and removing stopwords.

import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

def preprocess_text(text):
    # Convert text to lowercase and remove special characters
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize, remove stopwords, and apply stemming
    words = [stemmer.stem(word) for word in text.split() if word not in ENGLISH_STOP_WORDS]
    return ' '.join(words)

# Apply preprocessing to all text data
processed_texts = [preprocess_text(text) for text in texts]

Here, the code converts text to lowercase, removes special characters, eliminates stopwords, and applies stemming to convert words to their root forms.

3. Feature Extraction

Next, we convert the processed text into numerical vectors using TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000)

# Transform the text data into TF-IDF vectors
X = vectorizer.fit_transform(processed_texts)

The code above generates TF-IDF vectors with a maximum of 1000 features, converting the text into numerical form based on term frequency and inverse document frequency.

4. Model Construction and Training

We then build and train a logistic regression model for document classification.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

The dataset is split into 80% training data and 20% test data. We use a logistic regression model to train and classify the news articles.

5. Model Evaluation

Finally, we evaluate the trained model using the test data.

from sklearn.metrics import classification_report, accuracy_score

# Make predictions using the test data
y_pred = model.predict(X_test)

# Display accuracy and classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=newsgroups.target_names)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)

The code evaluates the model by measuring its accuracy and generating a classification report that includes precision, recall, and F1 scores for each category.

Improving the Model

1. Adding Features

Instead of using TF-IDF, employing Word2Vec or FastText for vectorization could better capture the semantic meaning of the text.

2. Changing the Model

Switching to a more advanced model, such as Support Vector Machines (SVM) or deep learning models, may further improve classification accuracy.

3. Hyperparameter Tuning

Adjusting the model’s hyperparameters (e.g., regularization parameters, learning rates) can also enhance model performance.

Summary

In this episode, we implemented a news article classification model using Python, covering the entire process from data preprocessing to feature extraction, model construction, and evaluation. Document classification is a fundamental NLP task with numerous applications. Next time, we will explore sentiment analysis, detailing methods to determine sentiment from text.

Next Episode Preview

Next time, we will delve into the basics of sentiment analysis, explaining how to implement techniques to identify emotions in text. Stay tuned!


Notes

  1. Stopwords: Words considered less important in text analysis (e.g., “the”, “is”).
  2. Stemming: A method that reduces words to their base form (e.g., converting “running” to “run”).
  3. Hyperparameter Tuning: Adjusting parameters to optimize a machine learning model’s performance.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC