Introduction
In my earlier article, I mentioned one of many earliest Deep Studying approaches for picture captioning. Should you’re curious about studying it, you could find the hyperlink to that article on the finish of this one.
Immediately, I wish to speak about Picture Captioning once more, however this time with the extra superior neural community structure. The deep studying I’m going to speak about is the one proposed within the paper titled “CPTR: Full Transformer Community for Picture Captioning,” written by Liu et al. again in 2021 [1]. Particularly, right here I’ll reproduce the mannequin proposed within the paper and clarify the underlying idea behind the structure. Nevertheless, remember that I received’t truly show the coaching course of since I solely need to deal with the mannequin structure.
The concept behind CPTR
The truth is, the principle thought of the CPTR structure is precisely the identical as the sooner picture captioning mannequin, as each use the encoder-decoder construction. Beforehand, within the paper titled “Present and Inform: A Neural Picture Caption Generator” [2], the fashions used are GoogLeNet (a.ok.a. Inception V1) and LSTM for the 2 parts, respectively. The illustration of the mannequin proposed within the Present and Inform paper is proven within the following determine.

Regardless of having the identical encoder-decoder construction, what makes CPTR completely different from the earlier strategy is the premise of the encoder and the decoder themselves. In CPTR, we mix the encoder a part of the ViT (Imaginative and prescient Transformer) mannequin with the decoder a part of the unique Transformer mannequin. Using transformer-based structure for each parts is basically the place the title CPTR comes from: CaPtion TransformeR.
Be aware that the discussions on this article are going to be extremely associated to ViT and Transformer, so I extremely advocate you learn my earlier article about these two matters for those who’re not but aware of them. Yow will discover the hyperlinks on the finish of this text.
Determine 2 reveals what the unique ViT structure appears like. All the things contained in the inexperienced field is the encoder a part of the structure to be adopted because the CPTR encoder.

Subsequent, Determine 3 shows the unique Transformer structure. The parts enclosed within the blue field are the layers that we’re going to implement within the CPTR decoder.

If we mix the parts contained in the inexperienced and blue bins above, we’re going to get hold of the structure proven in Determine 4 beneath. That is precisely what the CPTR mannequin we’re going to implement appears like. The concept right here is that the ViT Encoder (inexperienced) works by encoding the enter picture into a selected tensor illustration which is able to then be used as the premise of the Transformer Decoder (blue) to generate the corresponding caption.

That’s just about all the things it’s good to know for now. I’ll clarify extra in regards to the particulars as we undergo the implementation.
Module imports & parameter configuration
As all the time, the very first thing we have to do within the code is to import the required modules. On this case, we solely import torch and torch.nn since we’re about to implement the mannequin from scratch.
# Codeblock 1
import torch
import torch.nn as nn
Subsequent, we’re going to initialize some parameters in Codeblock 2. When you have learn my earlier article about picture captioning with GoogLeNet and LSTM, you’ll discover that right here, we acquired much more parameters to initialize. On this article, I need to reproduce the CPTR mannequin as carefully as doable to the unique one, so the parameters talked about within the paper will probably be used on this implementation.
# Codeblock 2
BATCH_SIZE = 1 #(1)
IMAGE_SIZE = 384 #(2)
IN_CHANNELS = 3 #(3)
SEQ_LENGTH = 30 #(4)
VOCAB_SIZE = 10000 #(5)
EMBED_DIM = 768 #(6)
PATCH_SIZE = 16 #(7)
NUM_PATCHES = (IMAGE_SIZE//PATCH_SIZE) ** 2 #(8)
NUM_ENCODER_BLOCKS = 12 #(9)
NUM_DECODER_BLOCKS = 4 #(10)
NUM_HEADS = 12 #(11)
HIDDEN_DIM = EMBED_DIM * 4 #(12)
DROP_PROB = 0.1 #(13)
The primary parameter I need to clarify is the BATCH_SIZE
, which is written on the line marked with #(1)
. The quantity assigned to this variable isn’t fairly essential in our case since we aren’t truly going to coach this mannequin. This parameter is ready to 1 as a result of, by default, PyTorch treats enter tensors as a batch of samples. Right here I assume that we solely have a single pattern in a batch.
Subsequent, do not forget that within the case of picture captioning we’re coping with photos and texts concurrently. This primarily implies that we have to set the parameters for the 2. It’s talked about within the paper that the mannequin accepts an RGB picture of measurement 384×384 for the encoder enter. Therefore, we assign the values for IMAGE_SIZE
and IN_CHANNELS
variables based mostly on this data (#(2)
and #(3)
). However, the paper doesn’t point out the parameters for the captions. So, right here I assume that the size of the caption is not more than 30 phrases (#(4)
), with the vocabulary measurement estimated at 10000 distinctive phrases (#(5)
).
The remaining parameters are associated to the mannequin configuration. Right here we set the EMBED_DIM
variable to 768 (#(6)
). Within the encoder aspect, this quantity signifies the size of the characteristic vector that represents every 16×16 picture patch (#(7)
). The identical idea additionally applies to the decoder aspect, however in that case the characteristic vector will characterize a single phrase within the caption. Speaking extra particularly in regards to the PATCH_SIZE
parameter, we’re going to use the worth to compute the full variety of patches within the enter picture. For the reason that picture has the scale of 384×384, there will probably be 576 patches in whole (#(8)
).
In terms of utilizing an encoder-decoder structure, it’s doable to specify the variety of encoder and decoder blocks for use. Utilizing extra blocks sometimes permits the mannequin to carry out higher by way of the accuracy, but in return, it is going to require extra computational energy. The authors of this paper determined to stack 12 encoder blocks (#(9)
) and 4 decoder blocks (#(10)
). Subsequent, since CPTR is a transformer-based mannequin, it’s essential to specify the variety of consideration heads inside the consideration blocks contained in the encoders and the decoders, which on this case authors use 12 consideration heads (#(11)
). The worth for the HIDDEN_DIM
parameter isn’t talked about wherever within the paper. Nevertheless, in response to the ViT and the Transformer paper, this parameter is configured to be 4 occasions bigger than EMBED_DIM
(#(12)
). The dropout price isn’t talked about within the paper both. Therefore, I arbitrarily set DROP_PROB
to 0.1 (#(13)
).
Encoder
Because the modules and parameters have been arrange, now that we’ll get into the encoder a part of the community. On this part we’re going to implement and clarify each single element contained in the inexperienced field in Determine 4 one after the other.
Patch embedding

You may see in Determine 5 above that step one to be accomplished is dividing the enter picture into patches. That is primarily accomplished as a result of as a substitute of specializing in native patterns like CNNs, ViT captures world context by studying the relationships between these patches. We are able to mannequin this course of with the Patcher
class proven within the Codeblock 3 beneath. For the sake of simplicity, right here I additionally embody the method contained in the patch embedding block inside the identical class.
# Codeblock 3
class Patcher(nn.Module):
def __init__(self):
tremendous().__init__()
#(1)
self.unfold = nn.Unfold(kernel_size=PATCH_SIZE, stride=PATCH_SIZE)
#(2)
self.linear_projection = nn.Linear(in_features=IN_CHANNELS*PATCH_SIZE*PATCH_SIZE,
out_features=EMBED_DIM)
def ahead(self, photos):
print(f'imagestt: {photos.measurement()}')
photos = self.unfold(photos) #(3)
print(f'after unfoldt: {photos.measurement()}')
photos = photos.permute(0, 2, 1) #(4)
print(f'after permutet: {photos.measurement()}')
options = self.linear_projection(photos) #(5)
print(f'after lin projt: {options.measurement()}')
return options
The patching itself is finished utilizing the nn.Unfold
layer (#(1)
). Right here we have to set each the kernel_size
and stride
parameters to PATCH_SIZE (16)
in order that the ensuing patches don’t overlap with one another. This layer additionally routinely flattens these patches as soon as it’s utilized to the enter picture. In the meantime, the nn.Linear layer
(#(2)
) is employed to carry out linear projection, i.e., the method accomplished by the patch embedding block. By setting the out_features
parameter to EMBED_DIM
, this layer will map each single flattened patch right into a characteristic vector of size 768.
The whole course of ought to make extra sense when you learn the ahead()
technique. You may see at line #(3)
in the identical codeblock that the enter picture is straight processed by the unfold layer. Subsequent, we have to course of the ensuing tensor with the permute()
technique (#(4)
) to swap the primary and the second axis earlier than feeding it to the linear_projection
layer (#(5)
). Moreover, right here I additionally print out the tensor dimension after every layer in an effort to higher perceive the transformation made at every step.
In an effort to test if our Patcher
class works correctly, we will simply cross a dummy tensor by means of the community. Have a look at the Codeblock 4 beneath to see how I do it.
# Codeblock 4
patcher = Patcher()
photos = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
options = patcher(photos)
# Codeblock 4 Output
photos : torch.Measurement([1, 3, 384, 384])
after unfold : torch.Measurement([1, 768, 576]) #(1)
after permute : torch.Measurement([1, 576, 768]) #(2)
after lin proj : torch.Measurement([1, 576, 768]) #(3)
The tensor I handed above represents an RGB picture of measurement 384×384. Right here we will see that after the unfold operation is carried out, the tensor dimension modified to 1×768×576 (#(1)
), denoting the flattened 3×16×16 patch for every of the 576 patches. Sadly, this output form doesn’t match what we’d like. Keep in mind that in ViT, we understand picture patches as a sequence, so we have to swap the first and 2nd axes as a result of sometimes, the first dimension of a tensor represents the temporal axis, whereas the 2nd one represents the characteristic vector of every timestep. Because the permute()
operation is carried out, our tensor is now having the dimension of 1×576×768 (#(2)
). Lastly, we cross this tensor by means of the linear projection layer, which the ensuing tensor form stays the identical since we set the EMBED_DIM
parameter to the identical measurement (768) (#(3)
). Regardless of having the identical dimension, the knowledge contained within the closing tensor ought to be richer because of the transformation utilized by the trainable weights of the linear projection layer.
Learnable positional embedding

After the enter picture has efficiently been transformed right into a sequence of patches, the following factor to do is to inject the so-called positional embedding tensor. That is primarily accomplished as a result of a transformer with out positional embedding is permutation-invariant, that means that it treats the enter sequence as if their order doesn’t matter. Curiously, since a picture isn’t a literal sequence, we should always set the positional embedding to be learnable such that it will likely be in a position to considerably reorder the patch sequence that it thinks works finest in representing the spatial data. Nevertheless, remember that the time period “reordering” right here doesn’t imply that we bodily rearrange the sequence. Reasonably, it does so by adjusting the embedding weights.
The implementation is fairly easy. All we have to do is simply to initialize a tensor utilizing nn.Parameter
which the dimension is ready to match with the output from the Patcher
mannequin, i.e., 576×768. Additionally, don’t overlook to put in writing requires_grad=True
simply to make sure that the tensor is trainable. Have a look at the Codeblock 5 beneath for the small print.
# Codeblock 5
class LearnableEmbedding(nn.Module):
def __init__(self):
tremendous().__init__()
self.learnable_embedding = nn.Parameter(torch.randn(measurement=(NUM_PATCHES, EMBED_DIM)),
requires_grad=True)
def ahead(self):
pos_embed = self.learnable_embedding
print(f'learnable embeddingt: {pos_embed.measurement()}')
return pos_embed
Now let’s run the next codeblock to see whether or not our LearnableEmbedding
class works correctly. You may see within the printed output that it efficiently created the positional embedding tensor as anticipated.
# Codeblock 6
learnable_embedding = LearnableEmbedding()
pos_embed = learnable_embedding()
# Codeblock 6 Output
learnable embedding : torch.Measurement([576, 768])
The primary encoder block

The subsequent factor we’re going to do is to assemble the principle encoder block displayed within the Determine 7 above. Right here you possibly can see that this block consists of a number of sub-components, particularly self-attention, layer norm, FFN (Feed-Ahead Community), and one other layer norm. The Codeblock 7a beneath reveals how I initialize these layers contained in the __init__()
technique of the EncoderBlock
class.
# Codeblock 7a
class EncoderBlock(nn.Module):
def __init__(self):
tremendous().__init__()
#(1)
self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
num_heads=NUM_HEADS,
batch_first=True, #(2)
dropout=DROP_PROB)
self.layer_norm_0 = nn.LayerNorm(EMBED_DIM) #(3)
self.ffn = nn.Sequential( #(4)
nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),
nn.GELU(),
nn.Dropout(p=DROP_PROB),
nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),
)
self.layer_norm_1 = nn.LayerNorm(EMBED_DIM) #(5)
I’ve beforehand talked about that the thought of ViT is to seize the relationships between patches inside a picture. This course of is finished by the multihead consideration layer I initialize at line #(1)
within the above codeblock. One factor to bear in mind right here is that we have to set the batch_first parameter to True
(#(2)
). That is primarily accomplished in order that the eye layer will probably be appropriate with our tensor form, through which the batch dimension (batch_size
) is on the 0th axis of the tensor. Subsequent, the 2 layer normalization layers must be initialized individually, as proven at line #(3)
and #(5)
. Lastly, we initialize the FFN block at line #(4)
, which the layers stacked utilizing nn.Sequential
follows the construction outlined within the following equation.

Because the __init__()
technique is full, we’ll now proceed with the ahead()
technique. Let’s check out the Codeblock 7b beneath.
# Codeblock 7b
def ahead(self, options): #(1)
residual = options #(2)
print(f'options & residualt: {residual.measurement()}')
#(3)
options, self_attn_weights = self.self_attention(question=options,
key=options,
worth=options)
print(f'after self attentiont: {options.measurement()}')
print(f"self attn weightst: {self_attn_weights.form}")
options = self.layer_norm_0(options + residual) #(4)
print(f'after normtt: {options.measurement()}')
residual = options
print(f'nfeatures & residualt: {residual.measurement()}')
options = self.ffn(options) #(5)
print(f'after ffntt: {options.measurement()}')
options = self.layer_norm_1(options + residual)
print(f'after normtt: {options.measurement()}')
return options
Right here you possibly can see that the enter tensor is called options (#(1)
). I title it this manner as a result of the enter of the EncoderBlock
is the picture that has already been processed with Patcher
and LearnableEmbedding
, as a substitute of a uncooked picture. Earlier than doing something, discover within the encoder
block that there’s a department separated from the principle move which then returns again to the normalization layer. This department is often generally known as a residual connection. To implement this, we have to retailer the unique enter tensor to the residual variable as I show at line #(2)
. Because the enter tensor has been copied, now we’re able to course of the unique enter with the multihead consideration layer (#(3)
). Since this can be a self-attention (not a cross-attention), the question
, key
, and worth
inputs for this layer are all derived from the options
tensor. Subsequent, the layer normalization operation is then carried out at line #(4)
, which the enter for this layer already comprises data from the eye block in addition to the residual connection. The remaining steps are principally the identical as what I simply defined, besides that right here we exchange the self-attention block with FFN (#(5)
).
Within the following codeblock, I’ll check the EncoderBlock
class by passing a dummy tensor of measurement 1×576×768, simulating an output tensor from the earlier operations.
# Codeblock 8
encoder_block = EncoderBlock()
options = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
options = encoder_block(options)
Under is what the tensor dimension appears like all through your complete course of contained in the mannequin.
# Codeblock 8 Output
options & residual : torch.Measurement([1, 576, 768]) #(1)
after self consideration : torch.Measurement([1, 576, 768])
self attn weights : torch.Measurement([1, 576, 576]) #(2)
after norm : torch.Measurement([1, 576, 768])
options & residual : torch.Measurement([1, 576, 768])
after ffn : torch.Measurement([1, 576, 768]) #(3)
after norm : torch.Measurement([1, 576, 768]) #(4)
Right here you possibly can see that the ultimate output tensor (#(4)
) has the identical measurement because the enter (#(1)
), permitting us to stack a number of encoder blocks with out having to fret about messing up the tensor dimensions. Not solely that, the scale of the tensor additionally seems to be unchanged from the start all the way in which to the final layer. The truth is, there are literally plenty of transformations carried out inside the eye block, however we simply can’t see it for the reason that total course of is finished internally by the nn.MultiheadAttention
layer. One of many tensors produced within the layer that we will observe is the eye weight (#(2)
). This weight matrix, which has the scale of 576×576, is accountable for storing data relating to the relationships between one patch and each different patch within the picture. Moreover, modifications in tensor dimension truly additionally occurred contained in the FFN layer. The characteristic vector of every patch which has the preliminary size of 768 modified to 3072 and instantly shrunk again to 768 once more (#(3)
). Nevertheless, this transformation isn’t printed for the reason that course of is wrapped with nn.Sequential
again at line #(4) in Codeblock 7a.
ViT encoder

As we’ve completed implementing all encoder parts, now that we’ll assemble them to assemble the precise ViT Encoder. We’re going to do it within the Encoder
class in Codeblock 9.
# Codeblock 9
class Encoder(nn.Module):
def __init__(self):
tremendous().__init__()
self.patcher = Patcher() #(1)
self.learnable_embedding = LearnableEmbedding() #(2)
#(3)
self.encoder_blocks = nn.ModuleList(EncoderBlock() for _ in vary(NUM_ENCODER_BLOCKS))
def ahead(self, photos): #(4)
print(f'imagesttt: {photos.measurement()}')
options = self.patcher(photos) #(5)
print(f'after patchertt: {options.measurement()}')
options = options + self.learnable_embedding() #(6)
print(f'after be taught embedt: {options.measurement()}')
for i, encoder_block in enumerate(self.encoder_blocks):
options = encoder_block(options) #(7)
print(f"after encoder block #{i}t: {options.form}")
return options
Contained in the __init__()
technique, what we have to do is to initialize all parts we created earlier, i.e., Patcher
(#(1)
), LearnableEmbedding
(#(2)
), and EncoderBlock
(#(3)
). On this case, the EncoderBlock
is initialized inside nn.ModuleList
since we need to repeat it NUM_ENCODER_BLOCKS
(12) occasions. To the ahead()
technique, it initially works by accepting uncooked picture because the enter (#(4)
). We then course of it with the patcher
layer (#(5)
) to divide the picture into small patches and remodel them with the linear projection operation. The learnable positional embedding tensor is then injected into the ensuing output by element-wise addition (#(6)
). Lastly, we cross it into the 12 encoder blocks sequentially with a easy for loop (#(7)
).
Now, in Codeblock 10, I’m going to cross a dummy picture by means of your complete encoder. Be aware that since I need to deal with the move of this Encoder class, I re-run the earlier courses we created earlier with the print()
capabilities commented out in order that the outputs will look neat.
# Codeblock 10
encoder = Encoder()
photos = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
options = encoder(photos)
And beneath is what the move of the tensor appears like. Right here, we will see that our dummy enter picture efficiently handed by means of all layers within the community, together with the encoder blocks that we repeat 12 occasions. The ensuing output tensor is now context-aware, that means that it already comprises details about the relationships between patches inside the picture. Subsequently, this tensor is now able to be processed additional with the decoder, which is able to later be mentioned within the subsequent part.
# Codeblock 10 Output
photos : torch.Measurement([1, 3, 384, 384])
after patcher : torch.Measurement([1, 576, 768])
after be taught embed : torch.Measurement([1, 576, 768])
after encoder block #0 : torch.Measurement([1, 576, 768])
after encoder block #1 : torch.Measurement([1, 576, 768])
after encoder block #2 : torch.Measurement([1, 576, 768])
after encoder block #3 : torch.Measurement([1, 576, 768])
after encoder block #4 : torch.Measurement([1, 576, 768])
after encoder block #5 : torch.Measurement([1, 576, 768])
after encoder block #6 : torch.Measurement([1, 576, 768])
after encoder block #7 : torch.Measurement([1, 576, 768])
after encoder block #8 : torch.Measurement([1, 576, 768])
after encoder block #9 : torch.Measurement([1, 576, 768])
after encoder block #10 : torch.Measurement([1, 576, 768])
after encoder block #11 : torch.Measurement([1, 576, 768])
ViT encoder (various)
I need to present you one thing earlier than we discuss in regards to the decoder. Should you assume that our strategy above is simply too sophisticated, it’s truly doable so that you can use nn.TransformerEncoderLayer
from PyTorch so that you just don’t must implement the EncoderBlock
class from scratch. To take action, I’m going to reimplement the Encoder
class, however this time I’ll title it EncoderTorch
.
# Codeblock 11
class EncoderTorch(nn.Module):
def __init__(self):
tremendous().__init__()
self.patcher = Patcher()
self.learnable_embedding = LearnableEmbedding()
#(1)
encoder_block = nn.TransformerEncoderLayer(d_model=EMBED_DIM,
nhead=NUM_HEADS,
dim_feedforward=HIDDEN_DIM,
dropout=DROP_PROB,
batch_first=True)
#(2)
self.encoder_blocks = nn.TransformerEncoder(encoder_layer=encoder_block,
num_layers=NUM_ENCODER_BLOCKS)
def ahead(self, photos):
print(f'imagesttt: {photos.measurement()}')
options = self.patcher(photos)
print(f'after patchertt: {options.measurement()}')
options = options + self.learnable_embedding()
print(f'after be taught embedt: {options.measurement()}')
options = self.encoder_blocks(options) #(3)
print(f'after encoder blockst: {options.measurement()}')
return options
What we principally do within the above codeblock is that as a substitute of utilizing the EncoderBlock class, right here we use nn.TransformerEncoderLayer
(#(1)
), which is able to routinely create a single encoder block based mostly on the parameters we cross to it. To repeat it a number of occasions, we will simply use nn.TransformerEncoder
and cross a quantity to the num_layers
parameter (#(2)
). With this strategy, we don’t essentially want to put in writing the ahead cross in a loop like what we did earlier (#(3)
).
The testing code within the Codeblock 12 beneath is precisely the identical because the one in Codeblock 10, besides that right here I exploit the EncoderTorch
class. You can even see right here that the output is principally the identical because the earlier one.
# Codeblock 12
encoder_torch = EncoderTorch()
photos = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
options = encoder_torch(photos)
# Codeblock 12 Output
photos : torch.Measurement([1, 3, 384, 384])
after patcher : torch.Measurement([1, 576, 768])
after be taught embed : torch.Measurement([1, 576, 768])
after encoder blocks : torch.Measurement([1, 576, 768])
Decoder
As we’ve efficiently created the encoder a part of the CPTR structure, now that we’ll discuss in regards to the decoder. On this part I’m going to implement each single element contained in the blue field in Determine 4. Primarily based on the determine, we will see that the decoder accepts two inputs, i.e., the picture caption floor reality (the decrease a part of the blue field) and the sequence of embedded patches produced by the encoder (the arrow coming from the inexperienced field). It is very important know that the structure drawn in Determine 4 is meant for instance the coaching part, the place your complete caption floor reality is fed into the decoder. Later within the inference part, we solely present a
Sinusoidal positional embedding

Should you check out the CPTR mannequin, you’ll see that step one within the decoder is to transform every phrase into the corresponding characteristic vector illustration utilizing the phrase embedding block. Nevertheless, since this step could be very simple, we’re going to implement it later. Now let’s assume that this phrase vectorization course of is already accomplished, so we will transfer to the positional embedding half.
As I’ve talked about earlier, since transformer is permutation-invariant by nature, we have to apply positional embedding to the enter sequence. Totally different from the earlier one, right here we use the so-called sinusoidal positional embedding. We are able to consider it like a technique to label every phrase vector by assigning numbers obtained from a sinusoidal wave. By doing so, we will count on our mannequin to know phrase orders because of the knowledge given by the wave patterns.
Should you return to Codeblock 6 Output, you’ll see that the positional embedding tensor within the encoder has the scale of NUM_PATCHES
× EMBED_DIM
(576×768). What we principally need to do within the decoder is to create a tensor having the scale of SEQ_LENGTH
× EMBED_DIM
(30×768), which the values are computed based mostly on the equation proven in Determine 11. This tensor is then set to be non-trainable as a result of a sequence of phrases should preserve a hard and fast order to protect its that means.

Right here I need to clarify the next code rapidly as a result of I even have mentioned this extra totally in my earlier article about Transformer. Typically talking, what we principally do right here is to create the sine and cosine wave utilizing torch.sin()
(#(1)
) and torch.cos()
(#(2)
). The ensuing two tensors are then merged utilizing the code at line #(3)
and #(4)
.
# Codeblock 13
class SinusoidalEmbedding(nn.Module):
def ahead(self):
pos = torch.arange(SEQ_LENGTH).reshape(SEQ_LENGTH, 1)
print(f"postt: {pos.form}")
i = torch.arange(0, EMBED_DIM, 2)
denominator = torch.pow(10000, i/EMBED_DIM)
print(f"denominatort: {denominator.form}")
even_pos_embed = torch.sin(pos/denominator) #(1)
odd_pos_embed = torch.cos(pos/denominator) #(2)
print(f"even_pos_embedt: {even_pos_embed.form}")
stacked = torch.stack([even_pos_embed, odd_pos_embed], dim=2) #(3)
print(f"stackedtt: {stacked.form}")
pos_embed = torch.flatten(stacked, start_dim=1, end_dim=2) #(4)
print(f"pos_embedt: {pos_embed.form}")
return pos_embed
Now we will test if the SinusoidalEmbedding
class above works correctly by working the Codeblock 14 beneath. As anticipated earlier, right here you possibly can see that the ensuing tensor has the scale of 30×768. This dimension matches with the tensor obtained by the method accomplished within the phrase embedding block, permitting them to be summed in an element-wise method.
# Codeblock 14
sinusoidal_embedding = SinusoidalEmbedding()
pos_embed = sinusoidal_embedding()
# Codeblock 14 Output
pos : torch.Measurement([30, 1])
denominator : torch.Measurement([384])
even_pos_embed : torch.Measurement([30, 384])
stacked : torch.Measurement([30, 384, 2])
pos_embed : torch.Measurement([30, 768])
Look-ahead masks

The subsequent factor I’m going to speak about within the decoder is the masked self-attention layer highlighted within the above determine. I’m not going to code the eye mechanism from scratch. Reasonably, I’ll solely implement the so-called look-ahead masks, which will probably be helpful for the self-attention layer in order that it doesn’t attend to the following phrases within the caption through the coaching part.
The way in which to do it’s fairly simple, what we have to do is simply to create a triangular matrix which the scale is ready to match with the eye weight matrix, i.e., SEQ_LENGTH
× SEQ_LENGTH
(30×30). Have a look at the create_mask()
operate beneath for the small print.
# Codeblock 15
def create_mask(seq_length):
masks = torch.tril(torch.ones((seq_length, seq_length))) #(1)
masks[mask == 0] = -float('inf') #(2)
masks[mask == 1] = 0 #(3)
return masks
Despite the fact that making a triangular matrix can merely be accomplished with torch.tril()
and torch.ones()
(#(1)
), however right here we have to make a little bit modification by altering the 0 values to -inf (#(2)
) and the 1s to 0 (#(3)
). That is primarily accomplished as a result of the nn.MultiheadAttention
layer applies the masks by element-wise addition. By assigning -inf to the following phrases, the eye mechanism will utterly ignore them. Once more, the inner course of inside an consideration layer has additionally been mentioned intimately in my earlier article about transformer.
Now I’m going to run the operate with seq_length=7
in an effort to see what the masks truly appears like. Later within the full move, we have to set the seq_length
parameter to SEQ_LENGTH
(30) in order that it matches with the precise caption size.
# Codeblock 16
mask_example = create_mask(seq_length=7)
mask_example
# Codeblock 16 Output
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf],
[0., 0., 0., 0., -inf, -inf, -inf],
[0., 0., 0., 0., 0., -inf, -inf],
[0., 0., 0., 0., 0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0.]])
The primary decoder block

We are able to see within the above determine that the construction of the decoder block is a bit longer than that of the encoder block. It looks like all the things is almost the identical, besides that the decoder half has a cross-attention mechanism and an extra layer normalization step positioned after it. This cross-attention layer can truly be perceived because the bridge between the encoder and the decoder, as it’s employed to seize the relationships between every phrase within the caption and each single patch within the enter picture. The 2 arrows coming from the encoder are the key and worth inputs for the eye layer, whereas the question is derived from the earlier layer within the decoder itself. Have a look at the Codeblock 17a and 17b beneath to see the implementation of your complete decoder block.
# Codeblock 17a
class DecoderBlock(nn.Module):
def __init__(self):
tremendous().__init__()
#(1)
self.self_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
num_heads=NUM_HEADS,
batch_first=True,
dropout=DROP_PROB)
#(2)
self.layer_norm_0 = nn.LayerNorm(EMBED_DIM)
#(3)
self.cross_attention = nn.MultiheadAttention(embed_dim=EMBED_DIM,
num_heads=NUM_HEADS,
batch_first=True,
dropout=DROP_PROB)
#(4)
self.layer_norm_1 = nn.LayerNorm(EMBED_DIM)
#(5)
self.ffn = nn.Sequential(
nn.Linear(in_features=EMBED_DIM, out_features=HIDDEN_DIM),
nn.GELU(),
nn.Dropout(p=DROP_PROB),
nn.Linear(in_features=HIDDEN_DIM, out_features=EMBED_DIM),
)
#(6)
self.layer_norm_2 = nn.LayerNorm(EMBED_DIM)
Within the __init__()
technique, we first initialize each self-attention (#(1)
) and cross-attention (#(3)
) layers with nn.MultiheadAttention
. These two layers seem like precisely the identical now, however later you’ll see the distinction within the ahead()
technique. The three layer normalization operations are initialized individually as proven at line #(2)
, #(4)
and #(6)
, since every of them will include completely different normalization parameters. Lastly, the ffn
layer (#(5)
) is precisely the identical because the one within the encoder, which principally follows the equation again in Determine 8.
Speaking in regards to the ahead()
technique beneath, it initially works by accepting three inputs: options
, captions
, and attn_mask
, which every of them denotes the tensor coming from the encoder, the tensor from the decoder itself, and a look-ahead masks, respectively (#(1)
). The remaining steps are considerably just like that of the EncoderBlock
, besides that right here we repeat the multihead consideration block twice. The primary consideration mechanism takes captions because the question
, key
, and worth
parameters (#(2)
). That is primarily accomplished as a result of we wish the layer to seize the context inside the captions tensor itself — therefore the title self-attention. Right here we additionally must cross the attn_mask parameter to this layer in order that it can not see the following phrases through the coaching part. The second consideration mechanism is completely different (#(3)
). Since we need to mix the knowledge from the encoder and the decoder, we have to cross the captions
tensor because the question
, whereas the options
tensor will probably be handed because the key
and worth
— therefore the title cross-attention. A glance-ahead masks isn’t needed within the cross-attention layer since later within the inference part the mannequin will be capable to see your complete enter picture without delay somewhat than wanting on the patches one after the other. Because the tensor has been processed by the 2 consideration layers, we’ll then cross it by means of the feed ahead community (#(4)
). Lastly, don’t overlook to create the residual connections and apply the layer normalization steps after every sub-component.
# Codeblock 17b
def ahead(self, options, captions, attn_mask): #(1)
print(f"attn_masktt: {attn_mask.form}")
residual = captions
print(f"captions & residualt: {captions.form}")
#(2)
captions, self_attn_weights = self.self_attention(question=captions,
key=captions,
worth=captions,
attn_mask=attn_mask)
print(f"after self attentiont: {captions.form}")
print(f"self attn weightst: {self_attn_weights.form}")
captions = self.layer_norm_0(captions + residual)
print(f"after normtt: {captions.form}")
print(f"nfeaturestt: {options.form}")
residual = captions
print(f"captions & residualt: {captions.form}")
#(3)
captions, cross_attn_weights = self.cross_attention(question=captions,
key=options,
worth=options)
print(f"after cross attentiont: {captions.form}")
print(f"cross attn weightst: {cross_attn_weights.form}")
captions = self.layer_norm_1(captions + residual)
print(f"after normtt: {captions.form}")
residual = captions
print(f"ncaptions & residualt: {captions.form}")
captions = self.ffn(captions) #(4)
print(f"after ffntt: {captions.form}")
captions = self.layer_norm_2(captions + residual)
print(f"after normtt: {captions.form}")
return captions
Because the DecoderBlock
class is accomplished, we will now check it with the next code.
# Codeblock 18
decoder_block = DecoderBlock()
options = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM) #(1)
captions = torch.randn(BATCH_SIZE, SEQ_LENGTH, EMBED_DIM) #(2)
look_ahead_mask = create_mask(seq_length=SEQ_LENGTH) #(3)
captions = decoder_block(options, captions, look_ahead_mask)
Right here we assume that options is a tensor containing a sequence of patch embeddings produced by the encoder
(#(1)
), whereas captions is a sequence of embedded phrases (#(2)
). The seq_length
parameter of the look-ahead masks is ready to SEQ_LENGTH
(30) to match it to the variety of phrases within the caption (#(3)
). The tensor dimensions after every step are displayed within the following output.
# Codeblock 18 Output
attn_mask : torch.Measurement([30, 30])
captions & residual : torch.Measurement([1, 30, 768])
after self consideration : torch.Measurement([1, 30, 768])
self attn weights : torch.Measurement([1, 30, 30]) #(1)
after norm : torch.Measurement([1, 30, 768])
options : torch.Measurement([1, 576, 768])
captions & residual : torch.Measurement([1, 30, 768])
after cross consideration : torch.Measurement([1, 30, 768])
cross attn weights : torch.Measurement([1, 30, 576]) #(2)
after norm : torch.Measurement([1, 30, 768])
captions & residual : torch.Measurement([1, 30, 768])
after ffn : torch.Measurement([1, 30, 768])
after norm : torch.Measurement([1, 30, 768])
Right here we will see that our DecoderBlock
class works correctly because it efficiently processed the enter tensors all the way in which to the final layer within the community. Right here I need you to take a more in-depth take a look at the eye weights at traces #(1)
and #(2)
. Primarily based on these two traces, we will affirm that our decoder implementation is right for the reason that consideration weight produced by the self-attention layer has the scale of 30×30 (#(1)
), which principally implies that this layer actually captured the context inside the enter caption. In the meantime, the eye weight matrix generated by the cross-attention layer has the scale of 30×576 (#(2)
), indicating that it efficiently captured the relationships between the phrases and the patches. This primarily implies that after cross-attention operation is carried out, the ensuing captions tensor has been enriched with the knowledge from the picture.
Transformer decoder

Now that we’ve efficiently created all parts for your complete decoder, what I’m going to do subsequent is to place them collectively right into a single class. Have a look at the Codeblock 19a and 19b beneath to see how I try this.
# Codeblock 19a
class Decoder(nn.Module):
def __init__(self):
tremendous().__init__()
#(1)
self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
embedding_dim=EMBED_DIM)
#(2)
self.sinusoidal_embedding = SinusoidalEmbedding()
#(3)
self.decoder_blocks = nn.ModuleList(DecoderBlock() for _ in vary(NUM_DECODER_BLOCKS))
#(4)
self.linear = nn.Linear(in_features=EMBED_DIM,
out_features=VOCAB_SIZE)
Should you examine this Decoder
class with the Encoder
class from codeblock 9, you’ll discover that they’re considerably related by way of the construction. Within the encoder, we convert picture patches into vectors utilizing Patcher
, whereas within the decoder we convert each single phrase within the caption right into a vector utilizing the nn.Embedding layer
(#(1)
), which I haven’t defined earlier. Afterward, we initialize the positional embedding layer, the place for the decoder we use the sinusoidal somewhat than the trainable one (#(2)
). Subsequent, we stack a number of decoder blocks utilizing nn.ModuleList
(#(3)
). The linear layer written at line #(4), which doesn’t exist within the encoder, is important to be carried out right here since it will likely be accountable to map every of the embedded phrases right into a vector of size VOCAB_SIZE
(10000). In a while, this vector will include the logit of each phrase within the dictionary, and what we have to do afterward is simply to take the index containing the very best worth, i.e., the almost definitely phrase to be predicted.
The move of the tensors inside the ahead()
technique itself can also be fairly just like the one within the Encoder
class. Within the Codeblock 19b beneath we cross options, captions, and attn_mask
because the enter (#(1)
). Understand that on this case the captions tensor comprises the uncooked phrase sequence, so we have to vectorize these phrases with the embedding layer beforehand (#(2)
). Subsequent, we inject the sinusoidal positional embedding tensor utilizing the code at line #(3)
earlier than ultimately passing it by means of the 4 decoder blocks sequentially (#(4)
). Lastly, we cross the ensuing tensor by means of the final linear layer to acquire the prediction
logits (#(5)
).
# Codeblock 19b
def ahead(self, options, captions, attn_mask): #(1)
print(f"featurestt: {options.form}")
print(f"captionstt: {captions.form}")
captions = self.embedding(captions) #(2)
print(f"after embeddingtt: {captions.form}")
captions = captions + self.sinusoidal_embedding() #(3)
print(f"after sin embedtt: {captions.form}")
for i, decoder_block in enumerate(self.decoder_blocks):
captions = decoder_block(options, captions, attn_mask) #(4)
print(f"after decoder block #{i}t: {captions.form}")
captions = self.linear(captions) #(5)
print(f"after lineartt: {captions.form}")
return captions
At this level you may be questioning why we don’t implement the softmax activation operate as drawn within the illustration. That is primarily as a result of through the coaching part, softmax is usually included inside the loss operate, whereas within the inference part, the index of the biggest worth will stay the identical no matter whether or not softmax is utilized.
Now let’s run the next testing code to test whether or not there are errors in our implementation. Beforehand I discussed that the captions enter of the Decoder
class is a uncooked phrase sequence. To simulate this, we will merely create a sequence of random integers ranging between 0 and VOCAB_SIZE
(10000) with the size of SEQ_LENGTH
(30) phrases (#(1)
).
# Codeblock 20
decoder = Decoder()
options = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(1)
captions = decoder(options, captions, look_ahead_mask)
And beneath is what the ensuing output appears like. Right here you possibly can see within the final line that the linear layer produced a tensor of measurement 30×10000, indicating that our decoder mannequin is now able to predicting the logit scores for every phrase within the vocabulary throughout all 30 sequence positions.
# Codeblock 20 Output
options : torch.Measurement([1, 576, 768])
captions : torch.Measurement([1, 30])
after embedding : torch.Measurement([1, 30, 768])
after sin embed : torch.Measurement([1, 30, 768])
after decoder block #0 : torch.Measurement([1, 30, 768])
after decoder block #1 : torch.Measurement([1, 30, 768])
after decoder block #2 : torch.Measurement([1, 30, 768])
after decoder block #3 : torch.Measurement([1, 30, 768])
after linear : torch.Measurement([1, 30, 10000])
Transformer decoder (various)
It’s truly additionally doable to make the code easier by changing the DecoderBlock
class with the nn.TransformerDecoderLayer
, similar to what we did within the ViT Encoder. Under is what the code appears like if we use this strategy as a substitute.
# Codeblock 21
class DecoderTorch(nn.Module):
def __init__(self):
tremendous().__init__()
self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
embedding_dim=EMBED_DIM)
self.sinusoidal_embedding = SinusoidalEmbedding()
#(1)
decoder_block = nn.TransformerDecoderLayer(d_model=EMBED_DIM,
nhead=NUM_HEADS,
dim_feedforward=HIDDEN_DIM,
dropout=DROP_PROB,
batch_first=True)
#(2)
self.decoder_blocks = nn.TransformerDecoder(decoder_layer=decoder_block,
num_layers=NUM_DECODER_BLOCKS)
self.linear = nn.Linear(in_features=EMBED_DIM,
out_features=VOCAB_SIZE)
def ahead(self, options, captions, tgt_mask):
print(f"featurestt: {options.form}")
print(f"captionstt: {captions.form}")
captions = self.embedding(captions)
print(f"after embeddingtt: {captions.form}")
captions = captions + self.sinusoidal_embedding()
print(f"after sin embedtt: {captions.form}")
#(3)
captions = self.decoder_blocks(tgt=captions,
reminiscence=options,
tgt_mask=tgt_mask)
print(f"after decoder blockst: {captions.form}")
captions = self.linear(captions)
print(f"after lineartt: {captions.form}")
return captions
The primary distinction you will notice within the __init__()
technique is using nn.TransformerDecoderLayer
and nn.TransformerDecoder
at line #(1)
and #(2)
, the place the previous is used to initialize a single decoder block, and the latter is for repeating the block a number of occasions. Subsequent, the ahead()
technique is generally just like the one within the Decoder
class, besides that the ahead propagation on the decoder blocks is routinely repeated 4 occasions with no need to be put inside a loop (#(3)
). One factor that it’s good to take note of within the decoder_blocks
layer is that the tensor coming from the encoder (options) should be handed because the argument for the reminiscence
parameter. In the meantime, the tensor from the decoder itself (captions) needs to be handed because the enter to the tgt
parameter.
The testing code for the DecoderTorch
mannequin beneath is principally the identical because the one written in Codeblock 20. Right here you possibly can see that this mannequin additionally generates the ultimate output tensor of measurement 30×10000.
# Codeblock 22
decoder_torch = DecoderTorch()
options = torch.randn(BATCH_SIZE, NUM_PATCHES, EMBED_DIM)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))
captions = decoder_torch(options, captions, look_ahead_mask)
# Codeblock 22 Output
options : torch.Measurement([1, 576, 768])
captions : torch.Measurement([1, 30])
after embedding : torch.Measurement([1, 30, 768])
after sin embed : torch.Measurement([1, 30, 768])
after decoder blocks : torch.Measurement([1, 30, 768])
after linear : torch.Measurement([1, 30, 10000])
The whole CPTR mannequin
Lastly, it’s time to place the encoder and the decoder half we simply created right into a single class to truly assemble the CPTR structure. You may see in Codeblock 23 beneath that the implementation could be very easy. All we have to do right here is simply to initialize the encoder (#(1)
) and the decoder (#(2)
) parts, then cross the uncooked photos and the corresponding caption floor truths in addition to the look-ahead masks to the ahead()
technique (#(3)). Moreover, additionally it is doable so that you can exchange the Encoder
and the Decoder
with EncoderTorch
and DecoderTorch
, respectively.
# Codeblock 23
class EncoderDecoder(nn.Module):
def __init__(self):
tremendous().__init__()
self.encoder = Encoder() #EncoderTorch() #(1)
self.decoder = Decoder() #DecoderTorch() #(2)
def ahead(self, photos, captions, look_ahead_mask): #(3)
print(f"imagesttt: {photos.form}")
print(f"captionstt: {captions.form}")
options = self.encoder(photos)
print(f"after encodertt: {options.form}")
captions = self.decoder(options, captions, look_ahead_mask)
print(f"after decodertt: {captions.form}")
return captions
We are able to do the testing by passing dummy tensors by means of it. See the Codeblock 24 beneath for the small print. On this case, photos is principally only a tensor of random numbers having the dimension of 1×3×384×384 (#(1)
), whereas captions is a tensor of measurement 1×30 containing random integers (#(2)
).
# Codeblock 24
encoder_decoder = EncoderDecoder()
photos = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(2)
captions = encoder_decoder(photos, captions, look_ahead_mask)
Under is what the output appears like. We are able to see right here that our enter photos and captions efficiently went by means of all layers within the community, which principally implies that the CPTR mannequin we created is now prepared to truly be skilled on picture captioning datasets.
# Codeblock 24 Output
photos : torch.Measurement([1, 3, 384, 384])
captions : torch.Measurement([1, 30])
after encoder : torch.Measurement([1, 576, 768])
after decoder : torch.Measurement([1, 30, 10000])
Ending
That was just about all the things in regards to the idea and implementation of the CaPtion TransformeR structure. Let me know what deep studying structure I ought to implement subsequent. Be happy to depart a remark for those who spot any errors on this article!
The code used on this article is accessible in my GitHub repo. Right here’s the hyperlink to my earlier article about picture captioning, Imaginative and prescient Transformer (ViT), and the unique Transformer.
References
[1] Wei Liu et al. CPTR: Full Transformer Community for Picture Captioning. Arxiv. https://arxiv.org/pdf/2101.10804 [Accessed November 16, 2024].
[2] Oriol Vinyals et al. Present and Inform: A Neural Picture Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed December 3, 2024].
[3] Picture initially created by writer based mostly on: Alexey Dosovitskiy et al. An Picture is Price 16×16 Phrases: Transformers for Picture Recognition at Scale. Arxiv. https://arxiv.org/pdf/2010.11929 [Accessed December 3, 2024].
[4] Picture initially created by writer based mostly on [6].
[5] Picture initially created by writer based mostly on [1].
[6] Ashish Vaswani et al. Consideration Is All You Want. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed December 3, 2024].