MENU

Glossary ~Deep Learning~

Convolutional Neural Network (CNN)

A convolutional neural network (CNN) is a type of neural network that is particularly suited to processing 2D data such as images and videos. CNNs have the ability to automatically learn spatial features of images (such as edges and textures), and are made up of convolutional layers, pooling layers, and fully connected layers. In the convolutional layer, a filter (kernel) is used to process a local region of the image and generate a feature map. In the pooling layer, spatial information is compressed to reduce computational cost and prevent over-learning. CNNs demonstrate overwhelming performance in a variety of computer vision tasks, including image classification, object detection, face recognition, and medical image analysis. In addition to image data, CNNs are also sometimes applied to fields such as speech processing and natural language processing.

Recurrent Neural Network (RNN)

A recurrent neural network (RNN) is a type of neural network designed to process time series data or data with a sequence. RNNs have a “loop structure” within the network, and they generate outputs based on new inputs while retaining past information as internal states. This makes them suitable for tasks that take into account the temporal dependencies of data, such as speech recognition, natural language processing, and time series prediction. However, conventional RNNs are prone to the “gradient disappearance problem” when learning long-term dependencies, and it is difficult to handle long sequences. To overcome this, improved RNNs such as Long Short-Term Memory (LSTM) and gated recurrent units (GRU) have been developed, making it possible to learn longer-term dependencies.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) that is designed to be able to learn long-term dependencies when processing time-dependent data. LSTM has three gate structures: an input gate, an oblivion gate, and an output gate, and it uses these to control information, remembering important information and forgetting unnecessary information. This reduces the “gradient disappearance problem” that is a problem with RNNs, and allows information to be retained over long periods of time while learning appropriately. LSTM has achieved excellent results in many application fields, including speech recognition, translation, time series prediction, and natural language processing. It is particularly effective in tasks with long contexts, such as text generation and machine translation.

Gated Recurrent Unit (GRU)

The Gated Recurrent Unit (GRU) is a variant of the recurrent neural network (RNN) and, like the LSTM, is a model designed to process time-dependent data. Like the LSTM, the GRU has a “gating mechanism” has a “gate mechanism”, but its structure is simpler, and whereas LSTM has three gates (input gate, forget gate, output gate), GRU has only two gates: an update gate and a reset gate. This simplicity makes GRUs computationally efficient, and they tend to have shorter training times. GRUs often perform as well as LSTMs, and they are widely used in the processing of various sequential data, such as time series prediction, natural language processing, and speech recognition. The choice between LSTMs and GRUs depends on the specific task and data.

Autoencoder

An autoencoder is a type of neural network for unsupervised learning that compresses input data and extracts features. An autoencoder usually consists of two parts: an encoder and a decoder. The encoder compresses the input data into a low-dimensional feature space (latent space), and the decoder reconstructs the original data based on the compressed information. This process is used to learn the characteristics of the data and reduce the dimensionality and remove noise from the data. Autoencoders are characterized by the fact that they can learn patterns in data using unlabeled data for unsupervised learning. There are also extended models such as variational autoencoders (VAEs) and sparse autoencoders, which are used in various applications such as data generation and anomaly detection.

Variational Autoencoder (VAE)

A variational autoencoder (VAE) is a generative model that extends the autoencoder, and has the ability to encode input data into a latent space and generate data from there. The main feature of a VAE is that, unlike a normal autoencoder, it treats the latent variable as a probability distribution (usually a normal distribution). This makes it possible to not only reproduce data, but also generate new data. In a VAE, the encoder encodes the input data into a latent space and generates samples from there. The decoder then reconstructs the data based on the sample. VAEs are used in image generation, anomaly detection, and noise removal of data, and are particularly attracting attention as a generative model alongside GANs (Generative Adversarial Networks). The strength of VAEs is that they can generate new data based on distributions with latent space meaning.

Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN) is a model that generates realistic data by having two neural networks, a generator and a discriminator, learn while competing against each other. The generator generates data (e.g. images) from random noise, and the discriminator determines whether the generated data is real or fake. By repeating this adversarial process, the generator becomes able to generate data that is closer to the real thing. GANs have been very successful in fields such as image generation, video generation, and style transfer, and have been applied to tasks such as painting-style image conversion and high-resolution image generation. However, GAN training is prone to instability and sometimes does not converge, so adjustments are required during training. GANs are notable for the fact that the data they generate is very realistic, and they are attracting attention as a type of generative model.

Transformers

Transformers are an innovative neural network model for natural language processing (NLP) that is widely used as an alternative to conventional recurrent neural networks (RNNs) and long short-term memories (LSTMs). Transformers are built around a self-attention mechanism and can efficiently process sequential data. Traditional RNNs and LSTMs are inefficient to compute because they process data sequentially, but Transformers significantly improve computational efficiency by computing the dependencies between all input tokens in parallel. Transformers have been successful in a wide range of NLP tasks, including machine translation, text generation, and document classification, and large-scale models such as Google’s BERT and OpenAI’s GPT are also based on Transformers. Transformers are characterized by their ability to efficiently learn long contexts and perform parallel computations.

Attention Mechanism

The attention mechanism is a method that enables a neural network to focus its “attention” on the particularly important parts of the input data. It is particularly effective in natural language processing and machine translation tasks. Rather than processing all the input at once, the model weights each input based on its importance and focuses on the more relevant information. This allows it to extract and process the important parts of long sequences and contextualized data. The attention mechanism is a core component of transformer models, and it efficiently learns the dependencies between input tokens, especially through self-attention. The attention mechanism has been successful in various NLP tasks such as machine translation, summarization, and question answering, and is extremely important for understanding context.

Self-Attention

Self-Attention is a type of attention mechanism that learns how important each element in the input sequence is to other elements. Self-attention plays a central role in Transformers, and it calculates how much each token depends on all other tokens. This allows us to more accurately capture the meaning and role of each token, taking into account the entire context. Unlike conventional RNNs and LSTMs, self-attention can process all inputs in parallel, so it can operate efficiently even on long sequences. In particular, in tasks such as machine translation and text generation, it is possible to perform translation and generation while looking at the entire context, which leads to high-precision results. Self-attention is an essential technology for models such as Transformers, BERT, and GPT, and has greatly improved model performance in NLP.

Multi-Head Attention

Multi-Head Attention is an extension of the attention mechanism in Transformers, a method that allows multiple different types of attention to be applied to input data in parallel, enabling the learning of dependencies between data from different perspectives. Specifically, the attention is divided into multiple heads, and the attention is calculated independently for each head, after which the results are combined to obtain the final output. This allows the model to simultaneously capture relationships in different feature spaces, and to learn richer representations. For example, it is possible to consider the relationships between different words in a sentence from multiple perspectives. Multi-head attention is very effective for natural language processing (NLP) tasks, and it demonstrates high performance in machine translation, text generation, and document classification. It is also an important element of transformer models, and it is used in models such as BERT and GPT.

Positional Encoding

Positional Encoding is a method used in transformer models to convey positional information in sequential data (e.g. sentences or time series data) to the model. While transformers can handle parallel processing, they do not have a mechanism for directly handling input order information, so positional encoding is used to encode the order and position of each element. Specifically, by adding a sequence of numbers based on sine and cosine waves to the input, the position information of each element is retained. This enables the transformer to progress with learning while taking into account the positional relationships of words in the sentence. Position encoding plays an important role in natural language processing tasks such as contextual understanding, machine translation, and text generation, and is essential for accurately capturing the dependency relationships between words in a sentence.

Residual Network (ResNet: Residual Network)

Residual Networks (ResNet) are models designed to enable the efficient learning of very deep neural networks, and since their introduction in 2015, they have achieved innovative results in the field of image recognition. The feature of ResNet is a structure called the “residual block”, which introduces a skip connection (shortcut connection) that directly adds the input to the output of each layer. Skip connections reduce the gradient vanishing problem even in models with deep layers, allowing training to proceed more efficiently. This structure makes it possible to train models with very deep layers (several hundred layers), allowing them to learn more complex data patterns. ResNet has been widely used since its success in the ImageNet competition, and is still used today for many visual tasks such as image classification, object detection, and image segmentation.

DenseNet

DenseNet is a type of neural network model that has a structure in which the output of each layer is used as the input for all subsequent layers. This “dense connection” allows the model to transmit information more efficiently and reduces the gradient vanishing problem. In DenseNet, each layer can refer to the output of all previous layers, so it is possible to reuse features and achieve high expressiveness while keeping the number of parameters low. In addition, because the information flow in each layer is better than in conventional models, it is possible to train even deep networks efficiently. DenseNet is widely used in fields such as image recognition, object detection, and image segmentation, and like ResNet, it is a model that has brought about important developments in deep learning. Furthermore, DenseNet is characterized by its ability to achieve high-precision models with few parameters and its high memory efficiency.

MobileNet

MobileNet is a lightweight and highly efficient neural network model designed for use in resource-constrained environments, such as mobile devices and embedded systems. The main feature of MobileNet is that it uses a method called Depthwise Separable Convolution, which divides the normal convolution operation into two stages and greatly reduces the amount of calculation. Specifically, first, convolution is performed for each input channel, and then, by combining each channel, it is possible to extract high-precision features with a small amount of calculation resources. MobileNet is particularly suited to tasks such as image recognition and object detection, and is widely used in AI applications that run in real time on smartphones and IoT devices. In addition, it is highly computationally efficient and has a small model size, so it is possible to provide advanced AI functionality even in environments with limited resources.

EfficientNet

EfficientNet is a model designed to efficiently scale neural networks, with the aim of achieving high-precision models with fewer computing resources. EfficientNet’s distinguishing feature is that it uses a method called “compound scaling” to scale the depth (Depth), width (Width), and resolution (Resolution) of the model in a unified manner. While previous models adjusted each parameter independently, EfficientNet achieves a significant improvement in computational efficiency by balancing the expansion of these three elements. EfficientNet achieves high accuracy with fewer parameters than other models in benchmarks such as ImageNet, and it demonstrates excellent performance, especially in tasks such as image classification and object detection. It can also operate efficiently on mobile and edge devices with limited resources, and is therefore being applied to a wide range of practical applications.

Semantic Segmentation

Semantic Segmentation is the task of assigning each pixel in an image to a class to which it belongs. For example, it identifies what objects in an image are, such as cars, roads, and trees, on a pixel-by-pixel basis, and groups pixels that belong to the same class together. This method provides more detailed information than object detection because it requires the recognition of the fine contours of objects. Semantic segmentation is widely used in fields that require detailed analysis of visual information, such as autonomous driving, medical image analysis (segmentation of organs and lesions), and robot vision. In general, architectures based on convolutional neural networks (CNNs) are used, and U-Net and Fully Convolutional Networks (FCN) are known as typical models. However, in semantic segmentation, objects that belong to the same class are not distinguished, but are grouped together as a single class.

Instance Segmentation

Instance segmentation is an extension of semantic segmentation, and is a task that not only classifies each object in an image at the pixel level, but also identifies different objects (instances) within the same class individually. For example, if there are multiple cars in an image, semantic segmentation will label all of them as the same class, but instance segmentation will recognize each car as a separate instance. This makes it possible to analyze each object in the image in more detail. Instance segmentation is used in many application fields, such as automated driving, surveillance systems, and medical image analysis, and algorithms such as Mask R-CNN are often used. This method is an approach that combines object detection and semantic segmentation to achieve more precise object recognition.

Batch Normalization

Batch Normalization is a method for stabilizing the training of neural networks and speeding up convergence, and it is particularly effective for deep networks. Batch normalization adjusts the distribution of the outputs of each layer so that it is constant by normalizing the inputs of each mini-batch. This reduces the gradient vanishing and gradient explosion problems, and speeds up learning. Batch normalization also makes the model more robust to learning rates, so stable training is possible even at high learning rates. In addition, batch normalization also has a regularization effect, which helps to prevent overfitting. Batch normalization is widely used in various models, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and greatly improves model accuracy and training efficiency.

Layer Normalization

Layer Normalization is a normalization method that stabilizes the training of neural networks. Unlike batch normalization, it normalizes each layer rather than the entire batch. Specifically, it normalizes the output of each neuron so that each neuron in the layer can learn on the same scale. This technique is particularly effective for recurrent neural networks (RNNs) and transformer models. While batch normalization is dependent on the batch size, layer normalization has no batch size constraints and can be used to stabilize learning in situations such as small-scale data and real-time processing. Layer normalization is widely used in the fields of natural language processing (NLP) and reinforcement learning, and it plays a role in stabilizing training and improving convergence speed, especially in models that handle sequence data.

Dropout

Dropout is a regularization method used to prevent overfitting in neural networks, and is achieved by randomly disabling (setting the output to zero) some of the neurons during training. Specifically, at each training step, a random subset of nodes in the network is chosen and their outputs are set to zero. This prevents the network from becoming overly dependent on specific neurons, and improves the generalization ability of the model. Dropout is particularly effective with deep neural networks, reducing the complexity of the model and preventing it from becoming overly adapted to the training data. The dropout inactivation rate (usually around 0.2 to 0.5) is set as a hyperparameter, and all nodes are used for prediction during inference. Dropout is used in a wide range of fields, including image recognition and natural language processing.

Activation Function

An activation function is a non-linear function used in neural networks to determine the output of each neuron. Activation functions play a role in introducing non-linearity into the model when processing input signals and calculating the output of neurons. This allows the network to learn complex patterns and features that go beyond simple linear relationships. Typical activation functions include the sigmoid function, ReLU (Rectified Linear Unit), and tanh (hyperbolic tangent function). Choosing the right activation function has a significant impact on the learning efficiency and performance of the neural network, so it is important to select the right one for the task. For example, ReLU is widely used in image processing, and the sigmoid function and softmax function are used in classification problems.

ReLU (Rectified Linear Unit)

ReLU (Rectified Linear Unit) is one of the most commonly used activation functions in neural networks, and is particularly effective in deep learning. The mathematical formula for ReLU is very simple: if the input is positive, it is output as is, and if it is negative, 0 is output. Specifically, the function is defined as follows:
[ f(x) = max(0, x) ]
This nonlinearity allows ReLU to make the model learn complex patterns. The main advantage of ReLU is that it is less prone to gradient vanishing problems than other activation functions (such as the sigmoid function or tanh). It is also computationally very light, and has the advantage of fast training convergence. However, it always returns 0 for negative inputs, so you need to be careful about the “dead ReLU” problem (a situation where a specific neuron always outputs 0 during training). To solve this problem, variants of ReLU such as Leaky ReLU and Parametric ReLU have been proposed.

Leaky ReLU

Leaky ReLU is a variant of ReLU, an activation function developed to alleviate the “dead ReLU” problem of ReLU. While a normal ReLU always outputs 0 for negative inputs, Leaky ReLU allows for small gradients even for negative inputs. Specifically, by introducing a small negative slope for negative inputs, it prevents the neuron from becoming completely “dead”. The formula for Leaky ReLU is as follows:
[ f(x) = max(0.01x, x) ]
Here, 0.01 indicates the magnitude of the negative slope, which can be adjusted as a hyperparameter. Leaky ReLU enables stable learning compared to conventional ReLU because it can learn even with negative inputs. In particular, it is effective for improving the gradient disappearance problem that tends to occur in deep learning models. Leaky ReLU is used in a wide range of tasks, including image recognition and natural language processing.

Sigmoid Function

The sigmoid function is a commonly used activation function in neural networks, and is a nonlinear function that converts input values from 0 to 1. The sigmoid function is defined as follows:
[ f(x) = 1/ (1 + e^{-x}})
This function has the property that if the input is a positive value, the output will approach 1, and if the input is a negative value, the output will approach 0. For this reason, the sigmoid function is often used in binary classification problems, and because the output can be interpreted as a probability, it is also used in applications such as logistic regression. However, the sigmoid function has some drawbacks. In particular, when the input is very large or very small, the gradient becomes very small, and the “gradient vanishing problem” occurs, which makes it difficult to learn. Currently, activation functions such as ReLU are becoming mainstream in deep learning, but the sigmoid function is still often used in classification tasks and output layers.

Tanh Function

The tanh function is a type of activation function used in neural networks, and is a nonlinear function similar to the sigmoid function. However, while the sigmoid function limits the output range to between 0 and 1, the tanh function scales the output to the range between -1 and 1. In mathematical terms, the tanh function is defined as follows:
[ f(x) = tanh(x) = e^x – e^{-x} / (e^x + e^{-x}} ]
This function approaches 1 when the input is a large positive value, and approaches -1 when the input is a negative value. The characteristic of the tanh function is that the output is distributed around zero. This improves the gradient flow compared to the sigmoid function, making it easier to learn. However, just like the sigmoid function, the Tanh function also has the potential to cause the gradient vanishing problem because the gradient becomes small for large or small values. At present, other activation functions such as ReLU are the mainstream in deep learning, but the Tanh function is still used in some tasks.

Softmax Function

The softmax function is an activation function that is mainly used in the output layer of classification problems, and is used to output the probability of each class in multi-class classification problems. The softmax function converts the input real values into probability values in the range 0 to 1, and adjusts the output so that it is normalized to 1 overall. Expressed in mathematical terms, the softmax function is defined as follows:
[ softmax(x_i) = e^{x_i} / (e^{x_i} + e^{x_j})
where, x_i is the output value of each class and n is the total number of classes. The probability of each class being the correct class is calculated based on how high the output of each class is compared to the other classes. The softmax function is often used in multi-class classification problems together with the cross-entropy loss function, and is used as the final layer of a classification problem. For example, in image recognition, if an image is classified as a cat, dog or bird, the softmax function is used to calculate the probability of each class, and the class with the highest probability is selected.

Vanishing Gradient Problem

The vanishing gradient problem is a problem that tends to occur in deep learning, particularly in recurrent neural networks (RNNs) and deep neural networks, and is a phenomenon in which the gradient (the amount of parameter update) becomes smaller and smaller as the layers become deeper when learning using the backpropagation method. If the gradient is too small, the model parameters will not be updated sufficiently and learning will not progress. This problem is particularly noticeable when using activation functions with limited output ranges, such as the sigmoid function or the tanh function. The gradient vanishing problem is particularly serious when learning data with long-term dependencies (e.g. time series data or long sequences). To solve this problem, activation functions that are difficult to make the gradient smaller, such as ReLU, have become widely used. In addition, special RNN architectures such as LSTM and GRU have also been developed to address the vanishing gradient problem.

Exploding Gradient Problem

The exploding gradient problem is the opposite of the vanishing gradient problem, and refers to the phenomenon that occurs during training using the backpropagation method in deep learning, where the gradient becomes extremely large and the amount of parameter update increases excessively. This phenomenon is particularly likely to occur in RNNs and very deep neural networks, and can cause the model training to become unstable due to excessively large gradients, and learning to progress incorrectly. Gradient explosion is also likely to occur when the learning rate is too large or when the network is not initialized properly. To prevent gradient explosion, the gradient clipping method is often used. In this method, when the gradient exceeds a certain threshold, the value is limited to prevent the gradient from becoming excessively large. Gradient explosion has a negative impact on the accuracy and stability of the model, so it is an important issue, especially for deep neural networks and models that process long sequences.

Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC