Recap and Today’s Theme
Hello! In the previous episode, we explored Word Embeddings, where words are represented as low-dimensional vectors that capture semantic relationships, improving NLP task performance. Compared to BoW and TF-IDF, word embeddings consider the semantic similarities between words.
Today, we will discuss Word2Vec, a fundamental method for learning word embeddings. Word2Vec learns the semantic similarities between words and represents them as vectors in a vector space. This episode covers the basic concepts of Word2Vec, its primary methods (CBOW and Skip-gram), and the specific mechanism behind the model
What is Word2Vec?
1. The Basic Concept of Word2Vec
Word2Vec is an algorithm developed by Google that converts words into low-dimensional continuous vectors that capture their semantic features. It uses the surrounding words (context) to learn the meaning of each target word, aligning with the Distributional Hypothesis, which states that “words with similar meanings appear in similar contexts.”
Once the training is complete, words are positioned in the vector space such that semantically similar words are located close to each other. For instance, “king” and “queen” or “dog” and “cat” are positioned near one another
2. Learning Based on the Distributional Hypothesis
Word2Vec leverages the Distributional Hypothesis, which suggests that “words with similar meanings appear in similar contexts.” By using this principle, Word2Vec learns the relationships between words and generates vectors that reflect their semantic proximity
Main Methods of Word2Vec: CBOW and Skip-gram
Word2Vec uses two main learning approaches:
1. CBOW (Continuous Bag of Words)
The CBOW model predicts the target (center) word based on the surrounding context words. Specifically, it uses N neighboring words to predict the central word. This method is efficient for training on large datasets and works well with frequently occurring words.
For example, in the sentence “The cat sits on the mat,” if “cat” is the center word, the context words (“The,” “sits,” “on,” “the”) are used to predict “cat”
2. Skip-gram
Skip-gram works in the opposite way of CBOW; it uses the center word to predict the surrounding context words. This model excels at learning rare and new words, as it captures broader context relationships.
In the previous example, Skip-gram would take “cat” as the center word and predict the surrounding words like “The,” “sits,” “on,” and “the”
The Mechanism of Word2Vec
1. Neural Network Structure
Word2Vec uses a single-layer neural network to learn word vectors. The model consists of the following three layers:
- Input Layer: The target word (context word for CBOW or center word for Skip-gram) is input in the form of a One-Hot encoded vector.
- Hidden Layer: This layer is a fully connected layer with nodes corresponding to the dimensions of the word vectors. The weight matrix in this layer is learned, generating the embedding vectors for each word.
- Output Layer: The predicted target word is output as a One-Hot encoded vector
2. Training Process
The training process in Word2Vec proceeds as follows:
- Generating Word Pairs: From the training data, pairs of target and context words are generated. In CBOW, context words are used to predict the target, while in Skip-gram, the target is used to predict the context.
- Training the Neural Network: The generated pairs are used to train the network. A One-Hot encoded vector is provided to the input layer, and the network computes the probability of the predicted word at the output layer.
- Backpropagation of Errors: The error between the predicted and actual output is computed and backpropagated to update the weights in the network, which are then used as the word embedding vectors
3. Negative Sampling and Hierarchical Softmax
To improve computational efficiency, Word2Vec uses two techniques:
a. Negative Sampling
Negative Sampling is a method used in both CBOW and Skip-gram to reduce computation costs by predicting only a small subset of “positive examples” (context words) and “negative examples” (non-context words). This significantly reduces the computation time and allows efficient learning
b. Hierarchical Softmax
Hierarchical Softmax models the probability distribution of words using a binary tree. It reduces the complexity of calculating word probabilities to logarithmic time, making it particularly effective for large vocabularies
Applications of Word2Vec
1. Word Similarity Calculations
Word vectors learned by Word2Vec can be used to calculate the similarity between words based on their distance in the vector space. For example, using vector arithmetic, it is possible to perform operations like “king” – “man” + “woman” = “queen,” demonstrating the semantic relationships captured in the vectors
2. Text Classification
Word2Vec vectors can be used to generate feature vectors for entire documents, which can then be applied to text classification tasks such as categorizing news articles or conducting sentiment analysis
3. Machine Translation
Word2Vec is also applied in machine translation. By mapping word vectors from different languages into the same vector space, it becomes possible to learn semantic correspondences between languages
Limitations of Word2Vec and Improvement Methods
1. Context-Dependent Issues
In Word2Vec, the same word is always represented by the same vector, making it difficult to capture words with multiple meanings (polysemy). Models like BERT, which generate context-dependent vectors, effectively address this issue
2. Out-of-Vocabulary Words
Word2Vec cannot handle words not present in the training data (out-of-vocabulary words). To overcome this, methods like FastText, which incorporates subword information, have been developed
Summary
This episode explained the basics of Word2Vec, its training methods (CBOW and Skip-gram), and efficiency techniques such as Negative Sampling and Hierarchical Softmax. Word2Vec is a powerful technique for representing semantic relationships in vector space and is widely used in various NLP applications.
Next Episode Preview
Next time, we will explore the GloVe model, another approach to word embedding that uses co-occurrence matrices. We will also compare it with Word2Vec to highlight their differences.
Notes
- Distributional Hypothesis: The idea that words with similar meanings appear in similar contexts.
- One-Hot Encoding: A method of representing words as vectors where the vector length is equal to the vocabulary size, with a single position set to 1 and others set to 0.
- Negative Sampling: A technique to reduce computation costs by using only a few negative examples during training
Comments