MENU

[AI from Scratch] Episode 258: Text Comparison Using Cosine Similarity

TOC

Recap and Today’s Theme

Hello! In the previous episode, we covered Named Entity Recognition (NER), a technique for extracting and categorizing proper nouns within text. NER is widely applied in tasks like information extraction and improving search engine performance.

Today, we will discuss Cosine Similarity, a method used to measure the similarity between text data by converting it into numerical vectors. Cosine similarity is commonly utilized in query searches, duplicate detection, and document clustering. In this episode, we will explain the basic concept of cosine similarity, its calculation method, and provide a practical implementation example.

What is Cosine Similarity?

1. Basic Concept of Cosine Similarity

Cosine Similarity measures the similarity between two vectors using the cosine of the angle between them. Mathematically, if the angle between two vectors is small, they are considered similar, while vectors forming a 90-degree angle are regarded as unrelated.

Cosine similarity is calculated using the following formula:

[
\text{Cosine Similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|}
]

Where:

  • ( \mathbf{A} \cdot \mathbf{B} ) represents the dot product of vectors A and B.
  • ( |\mathbf{A}| ) and ( |\mathbf{B}| ) denote the norms (lengths) of vectors A and B.

The value of cosine similarity ranges from -1 to 1:

  • Values close to 1 indicate high similarity.
  • Values near 0 imply no relationship.
  • Values around -1 suggest opposite directions.

2. Applying Cosine Similarity in Text Comparison

To measure the similarity between texts, cosine similarity converts the text data into numerical vectors. These vectors are then used to calculate the similarity. Text data can be transformed into vectors using methods such as TF-IDF or word embeddings.

Methods for Vectorizing Text

To compute the similarity between texts using cosine similarity, the text must first be converted into numerical vectors. Common methods include:

1. Bag-of-Words (BoW)

Bag-of-Words (BoW) is a simple method that represents text by counting the occurrence of words. Each dimension in the vector represents the frequency of a specific word. However, BoW does not consider the meaning or order of words, making it challenging to accurately capture semantic similarity.

2. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to evaluate the importance of words. Words that appear frequently in many documents receive low weights, while those that appear infrequently are assigned higher weights, extracting more meaningful features.

3. Word Embeddings

Word Embeddings use models like Word2Vec, GloVe, or BERT to convert words into vectors that reflect semantic relationships. The vector for the entire sentence or document can be obtained by averaging or summing the vectors of each word.

Implementation of Text Comparison Using Cosine Similarity

Here, we implement cosine similarity for text comparison using the scikit-learn library and the TF-IDF method.

1. Installing Required Libraries

First, install the necessary libraries:

pip install scikit-learn

2. Calculating Cosine Similarity

The following code converts text into TF-IDF vectors and calculates the cosine similarity:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "I love machine learning. It's amazing.",
    "Machine learning is a fascinating field.",
    "I enjoy doing natural language processing.",
    "Natural language processing is a challenging and exciting field."
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Vectorize the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Display the result
print("Cosine Similarity Matrix:")
print(cosine_sim)

This code converts four sample documents into TF-IDF vectors and calculates the cosine similarity between them. The cosine_sim matrix contains the similarity values between each pair of documents.

3. Calculating Similarity with a Specific Document

You can also calculate the similarity between a specific document and all others. For example, the following code displays the similarity between the first document and all others:

# Similarity with the first document
similarities = cosine_sim[0]

# Display results
for idx, score in enumerate(similarities):
    print(f"Document 0 vs Document {idx}: Similarity = {score:.2f}")

This code calculates and displays the cosine similarity between the first document (index 0) and all other documents.

Applications of Cosine Similarity

1. Document Retrieval

Cosine similarity is used to rank search results by calculating the similarity between a search query and documents. It helps identify the most relevant documents based on similarity to the query.

2. Duplicate Detection

Cosine similarity can detect duplicate documents or copied content. For example, it can determine the degree of overlap between news articles from different sources.

3. Clustering

Cosine similarity can be used to cluster documents with similar content. This technique organizes large text datasets efficiently by grouping similar documents.

Challenges and Improvements in Cosine Similarity

1. Ignoring Word Semantics

When using methods like TF-IDF or BoW, cosine similarity does not account for the semantic relationships between words. As a result, texts with the same meaning but different words may yield low similarity scores. Incorporating word embeddings or models like BERT can effectively improve this.

2. Influence of Long Texts

Long texts may obscure important words, as TF-IDF tends to emphasize words with high frequency in long texts, potentially diluting meaningful information. To address this, weighting techniques or context-dependent models can be introduced.

Summary

This episode explained cosine similarity for text comparison, covering its basic concept and implementation. Cosine similarity is a simple and effective technique used in document retrieval, duplicate detection, and clustering. It plays a vital role in many NLP tasks.

Next Episode Preview

Next time, we will discuss Thesaurus and WordNet, resources that capture word relationships and aid in understanding semantic connections.


Notes

  1. TF-IDF: A method combining term frequency and inverse document frequency to evaluate word importance within a text.
  2. BoW (Bag-of-Words): A method for vectorizing text based on word frequency, without considering word order or context.
  3. Word Embedding: A method for converting words into vectors that capture semantic relationships.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC