MENU

[AI from Scratch] Episode 264: N-gram Models

TOC

Recap and Today’s Theme

Hello! In the previous episode, we discussed evaluation methods for language models, focusing on metrics like Perplexity, BLEU, and ROUGE to measure performance.

Today, we will explore N-gram models, a fundamental probabilistic language model in natural language processing (NLP). N-gram models are used for tasks such as text generation and spell correction. This episode covers the basic structure of N-gram models, how to build them, and provides an implementation example.

What is an N-gram Model?

1. Basic Concept of N-grams

An N-gram refers to a sequence of N words or characters. The value of N is an integer representing the “window” size of the model:

  • Unigram (1-gram): Considers the probability of individual words.
  • Bigram (2-gram): Predicts the probability of a word based on the previous word.
  • Trigram (3-gram): Predicts the probability of a word based on the two preceding words.

As N increases, the model captures more context, but it also requires more data to account for the higher complexity.

2. How N-gram Models Work

N-gram models predict the probability of the next word based on the previous N-1 words, represented by the following conditional probability:

[
P(w_n | w_{n-1}, w_{n-2}, \ldots, w_{n-N+1})
]

Here, ( w_n ) is the word being predicted, and ( w_{n-1}, w_{n-2}, \ldots, w_{n-N+1} ) are the preceding N-1 words.

For example, in a bigram model, the probability is expressed as:

[
P(w_n | w_{n-1}) = \frac{\text{Count}(w_{n-1}, w_n)}{\text{Count}(w_{n-1})}
]

This formula calculates the ratio of the count of occurrences where ( w_n ) follows ( w_{n-1} ) to the total occurrences of ( w_{n-1} ).

Building an N-gram Model

1. Data Collection and Preprocessing

To build an N-gram model, a large amount of text data is required. The following preprocessing steps are performed:

  • Tokenization: Splitting sentences into words or characters.
  • Stopword Removal: Generally, common words like “the,” “is,” or “in” are retained in language models to maintain context.
  • Lowercasing: Converting all text to lowercase to ignore case differences.

2. Generating N-grams

After preprocessing, N-grams are generated from the text. For example, given the sentence:

“I love machine learning”

The generated bigrams (2-grams) would be:

  • (“I”, “love”)
  • (“love”, “machine”)
  • (“machine”, “learning”)

Similarly, trigrams (3-grams) would be:

  • (“I”, “love”, “machine”)
  • (“love”, “machine”, “learning”).

3. Calculating N-gram Probabilities

Next, the frequency of the generated N-grams is calculated, and conditional probabilities are derived. For a bigram model, the frequency of each word pair is used to determine the probability of the next word based on the preceding word.

Implementation Example of an N-gram Model

Here, we demonstrate a simple implementation of a bigram model using Python and the nltk library.

1. Installing Necessary Libraries

First, install the nltk library:

pip install nltk

2. Implementing a Bigram Model

Using the nltk library, we build a bigram model:

import nltk
from nltk.util import ngrams
from collections import Counter

# Sample text
text = "I love machine learning. Machine learning is fascinating."

# Preprocessing (Tokenization and Lowercasing)
tokens = nltk.word_tokenize(text.lower())

# Generating bigrams
bigrams = list(ngrams(tokens, 2))

# Counting bigram frequencies
bigram_freq = Counter(bigrams)

# Counting word frequencies
word_freq = Counter(tokens)

# Calculating conditional probabilities
def bigram_probability(bigram):
    word1, word2 = bigram
    return bigram_freq[bigram] / word_freq[word1]

# Displaying probabilities
for bigram in bigram_freq:
    print(f"P({bigram[1]} | {bigram[0]}) = {bigram_probability(bigram):.4f}")

This code tokenizes the text, generates bigrams, and calculates conditional probabilities based on their frequencies.

Challenges and Improvements in N-gram Models

1. Data Sparsity

As N increases, the model may encounter data sparsity, where rare word combinations lead to zero probabilities for unseen sequences.

Solution: Smoothing Techniques

Smoothing techniques (e.g., Laplace smoothing) can mitigate this problem by adding a small value to all frequency counts. Laplace smoothing, for instance, adds 1 to each count to assign a small probability to unseen sequences.

2. Lack of Long-Distance Dependency

N-gram models only consider a fixed window of N words, limiting their ability to capture long-term dependencies in text. This makes it challenging for the model to understand the overall context of a sentence.

Solution: Using Recurrent Neural Networks (RNNs) or Transformers

More advanced models like RNNs and Transformers address this limitation by capturing long-distance dependencies and overall context, leading to more accurate language modeling.

Applications of N-gram Models

1. Text Generation

N-gram models can predict the next word in a sequence, enabling text generation. For example, a bigram model can generate sentences by randomly selecting the next word based on the current word.

2. Spell Correction

N-gram models can also be used for spell correction by predicting the correct word based on the surrounding context. This is useful in applications where slight typographical errors need to be corrected automatically.

3. Language Detection

By analyzing the frequency of specific N-grams in each language, N-gram models can be used to identify the language of a given text.

Summary

This episode covered the basics of N-gram models, their structure, how they are built, and how to implement them. N-gram models are simple and easy to understand, making them useful for tasks such as text generation and spell correction.

Next Episode Preview

Next time, we will discuss spell correction, learning how to automatically fix typos and errors in text.


Notes

  1. N-gram: A sequence of N words or characters.
  2. Conditional Probability: The probability of an event occurring given that another event has already occurred.
  3. Laplace Smoothing: A technique used to prevent zero probabilities in N-gram models.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC