Papers Explained 205: LeViT
LeViT is a hybrid neural network for fast inference image classification. LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU.
Architecture
LeViT builds upon the ViT architecture and DeiT training method.
Patch embedding
LeViT applies 4 layers of 3×3 convolutions (stride 2) to the input to perform the resolution reduction. The number of channels goes C = 3, 32, 64, 128, 256. The patch extractor for LeViT-256 transforms the image shape (3, 224, 224) into (256, 14, 14) with 184 MFLOPs.
No classification token
To use the BCHW tensor format, the classification token is replaced with average pooling on the last activation map, (which produces an embedding used in the classifier), similar to convolutional networks. For distillation during training, separate heads are trained. At test time, the average of the output from the two heads is considered.
Normalization layers and activations
For LeViT, each convolution is followed by a batch normalization and all of LeViT’s non-linear activations are Hardswish.
Multi-resolution pyramid
LeViT integrates the ResNet stages within the transformer architecture. Inside the stages, the architecture is similar to a visual transformer: a residual structure with alternated MLP and activation blocks.
Downsampling
Between the LeViT stages, a shrinking attention block reduces the size of the activation map: a subsampling is applied before the Q transformation, which then propagates to the output of the soft activation.
Attention bias instead of a positional embedding
An attention bias is addedto the attention maps to provide positional information within each attention block, and to explicitly inject relative position information in the attention mechanism.
Smaller keys
The bias term reduces the pressure on the keys to encode location information, so we reduce the size of the keys matrices relative to the V matrix. If the keys have size D ∈ {16, 32}, V will have 2D channels. Restricting the size of the keys reduces the time needed to calculate the key product QK^T. For downsampling layers, where there is no residual connection, V is set to 4D to prevent loss of information.
Reducing the MLP blocks
For LeViT, the “MLP” is a 1×1 convolution, followed by the usual batch normalization. To reduce the computational cost of that phase, the expansion factor of the convolution is reducted from 4 to 2.
Experiments
LeViT are identified by the number of input channels to the first transformer, e.g. LeViT-256 has 256 channels at the input of the transformer stage.
Each stage consists of a number of pairs of Attention and MLP blocks. N: number of heads, C: number of channels, D: output dimension of the Q and K operators. Separating the stages are shrinking attention blocks whose values of C, C 0 are taken from the rows above and below respectively. Drop path with probability p is applied to each residual connection. The value of N in the stride-2 blocks is C/D to make up for the lack of a residual connection. Each attention block is followed by an MLP with expansion factor two.
LeViT are trained on the ImageNet-2012 dataset and evaluated on its validation set.
Results
Speed-accuracy tradeoffs
- LeViT-384 is on-par with DeiTSmall in accuracy but uses half the number of FLOPs.
- LeViT-128S is on-par with DeiT-Tiny and uses 4× fewer FLOPs.
- LeViT-192 and LeViT-256 have about the same accuracies as EfficientNet B2 and B3 but are 5× and 7× faster on CPU, respectively.
- LeViT-256 achieves the same score on Imagenet Real as EfficientNet B3, but is slightly worse (-0.6) on Imagenet V2 matched frequency.
- LeViT-384 performs slightly worse (-0.2) on Imagenet Real and -0.4 on Imagenet V2 matched frequency compared to DeiT-Small.
- Despite slightly lower accuracy on alternative test sets, LeViT maintains speed-accuracy trade-offs compared to EfficientNet and DeiT.
Comparison with the state of the art
- Token-to-token ViT variants require around 5× more FLOPs than LeViT-384 and more parameters for similar accuracies.
- Bottleneck transformers, “Visual Transformers,” and pyramid vision transformer are slower by about 5× compared to LeViT-192 with comparable accuracy.
- LeViT’s advantage comes from DeiT-like distillation, making it more accurate when trained on ImageNet alone.
- Pooling-based vision transformer (PiT) and CvT (ViT variants with pyramid structure) come close to LeViT, with PiT being the most promising but still 1.2× to 2.4× slower than LeViT.
Paper
LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference 2104.01136
Recommended Reading [Vision Transformers]
Hungry for more insights?
Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!
Do Subscribe for weekly updates!!