MENU

[AI from Scratch] Episode 257: Named Entity Recognition (NER)

TOC

Recap and Today’s Theme

Hello! In the previous episode, we covered the basics of Topic Modeling (LDA), explaining how to extract latent topics from documents. LDA is a powerful technique for summarizing and organizing text data.

Today, we’ll explore Named Entity Recognition (NER), a crucial technique in NLP that identifies and classifies entities like names, locations, and dates within text. NER is fundamental in various NLP applications, such as information extraction and document classification. This episode covers the basic concepts, common methods, and a practical implementation of NER using BERT.

What is Named Entity Recognition (NER)?

1. Basic Concept of NER

Named Entity Recognition (NER) is a task that identifies specific entities within a text and categorizes them into predefined classes. For example, NER can extract entities such as Person (PERSON), Location (LOCATION), Organization (ORGANIZATION), Date (DATE), and Money (MONEY) from a document. NER is widely used in applications like information retrieval, question answering, and document classification.

2. Applications of NER

NER is applied across various fields, such as:

  • Information Retrieval: Useful in searching for specific information about people or places.
  • Customer Support: Identifies product names or locations from user queries, improving response efficiency.
  • Healthcare: Extracts medication or disease names from medical records, aiding diagnosis and data analysis.

Common Methods for NER

NER can be performed using two main approaches: Rule-Based and Machine Learning-Based methods.

1. Rule-Based Approach

The rule-based method uses manually defined rules, such as regular expressions or keyword lists, to identify entities that match certain patterns.

  • Advantages: Simple to understand and can yield high accuracy in specific domains.
  • Disadvantages: Creating and maintaining rules is labor-intensive, and the approach struggles with generalization across different domains and languages.

2. Machine Learning-Based Approach

Machine learning-based NER uses labeled datasets to train models that automatically identify entities. Common methods include:

  • Conditional Random Fields (CRF): Considers sequential dependencies between words for label prediction.
  • Hidden Markov Model (HMM): Uses the probability of word occurrence and label transitions.
  • Deep Learning: Models like LSTM or BERT capture contextual information, allowing for sophisticated entity recognition.

Deep learning models, particularly BERT, are widely used due to their high ability to capture context.

Implementing NER with BERT

This section demonstrates how to implement NER using BERT and Hugging Face’s transformers library.

1. Preparation

First, install the necessary libraries:

pip install transformers
pip install torch
pip install tensorflow

2. Data Preparation

We will use the CoNLL-2003 NER dataset, which is commonly used for training NER models. This dataset contains words labeled with their corresponding entity types.

from transformers import BertTokenizer, TFBertForTokenClassification
from transformers import pipeline

# Load the BERT tokenizer and model
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = TFBertForTokenClassification.from_pretrained(model_name)

# Create the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

This code sets up a BERT-based NER model (bert-large-cased-finetuned-conll03-english) and initializes an NER pipeline.

3. Running NER

Now, let’s use the pipeline to extract entities from a sample text:

# Sample text
text = "Barack Obama was born in Hawaii and was the president of the United States."

# Perform NER
entities = ner_pipeline(text)

# Display results
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.2f}")

This code extracts entities from the text, outputting the entity, its type (e.g., PER for person), and the confidence score.

4. Handling Subword Tokens

BERT’s tokenizer splits words into subwords, so it’s essential to merge these subwords for proper NER results.

# Function to merge subwords in entities
def merge_subwords(entities):
    merged_entities = []
    current_entity = None

    for entity in entities:
        if entity['word'].startswith("##"):
            # Merge subword with the previous word
            current_entity['word'] += entity['word'][2:]
        else:
            if current_entity:
                merged_entities.append(current_entity)
            current_entity = entity

    if current_entity:
        merged_entities.append(current_entity)

    return merged_entities

# Merge subwords and display merged entities
merged_entities = merge_subwords(entities)
for entity in merged_entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.2f}")

This code combines subwords to form complete entities, ensuring the output is readable.

Evaluation Metrics for NER

Common metrics used to evaluate NER performance include:

  • Precision: The proportion of correctly extracted entities.
  • Recall: The proportion of actual entities that were correctly extracted.
  • F1 Score: The harmonic mean of precision and recall, providing a balanced measure of model performance.

Challenges and Solutions in NER

1. Domain Adaptation

NER models trained on general datasets may perform poorly in specialized domains (e.g., medical or legal texts). Transfer learning and domain-specific fine-tuning can enhance model accuracy in these cases.

2. Ambiguity

Words with multiple meanings (polysemy) can make entity recognition challenging. Using context-aware models like BERT helps mitigate this problem by understanding the context of each word.

Summary

This episode introduced Named Entity Recognition (NER), discussing its fundamental concepts, common methods, and an implementation using BERT. NER is essential in information extraction and various NLP applications, and improving its accuracy enhances the effectiveness of many systems.

Next Episode Preview

Next time, we will cover Cosine Similarity for Text Comparison, explaining how to calculate text similarity for applications like document clustering and information retrieval.


Notes

  1. Conditional Random Fields (CRF): A probabilistic model used for labeling sequence data.
  2. Subword Tokenization: A method used by BERT’s tokenizer that splits words into smaller units (subwords) for processing.
  3. Transfer Learning: Applying a pre-trained model to a new task to improve performance.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC