MViTv2 Improved multiscale vision transformers for classification and detection
Citation: Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph, “MViTv2: Improved multiscale vision transformers for classification and detection” 2021 IEEE Conference on Computer Vision and Pattern Recognition.
Brief Summary:
Convolutional neural networks (CNNs) are beginning to be replaced by Vision Transformers (ViTs) in many computer vision tasks. However, CNNs continue to possess architectural advantages over ViTs in computer vision tasks. For instance, locality is a key component of the convolution technique that CNNs inherit when handling image input (since images tend to hold locally-similar properties). The pooling of layers inside the network can give computational efficiency and hierarchical higher-level feature learning at the top layers, which is another aspect of CNN architecture. The paper on MViT introduced pooling layers to improve ViTs and this paper is an extension of that work. For improved accuracy and computational efficiency, this work provides decomposed relative positional embeddings and residual pooling connections. Additionally, it attained SOTA in the three Visual Recognition categories that received the highest evaluations. MViTv2 boasts cutting-edge performance: 86.1% on Kinetics-400 video classification; 88.8% accuracy on ImageNet classification, and 56.1 APbox on COCO object identification.
Main Contributions:
Leveraging decomposed location distances to insert position information into Transformer blocks through shift-invariant positional embeddings.
A residual pooling connection to offset the impact of pooling strides on the computation of attention.
Combining hybrid window attention with attention pooling helps improve the accuracy/computing trade-off.
Strengths:
This paper achieves state-of-the-art results in ImageNet classification with 88.8% accuracy.
This paper shows better performance than other ViT architectures such as Swin Transformer.
Not just classification, they have also explored domains such as Object Detection and Video Recognition with state-of-the-art results.
The paper is well-written; the code is provided for reproducibility, and has enough mathematical explanations and supporting figures.
Weakness:
When compared to MViTv1, this paper introduces very few changes to the existing architecture, which is primitive.
The paper does not discuss where the model would fail.
In-Depth Analysis of Strengths and Weaknesses:
The MViT paper relies heavily on absolute position embedding of the patches of images, which ignores the fundamental principle of shift-invariance. This paper adds a relative positional embedding concept that only depends on the relative location distance between tokens into the pooled self-attention computation. Along the spatiotemporal axes, distant computation is decomposed to reduce complexity.
To increase information flow and facilitate the training of pooling attention blocks in MViT, residual pooling connections are used. Due to the cheap cost of adding the pooled query sequence, this update continues to benefit from low-complexity attention computation with significant advancements in key and value pooling.
Self-Attention used in transformers is computationally expensive based on the number of tokens used. The Hybrid type Pooling Attention along with Feature Pyramid Network is used in this work for each window, and it outperformed the Swin Transformer in terms of performance.
The paper achieves state-of-the-art results in ImageNet Classification, Object Detection, and Video Recognition. The experimental results are explained in the next section. The paper is explained very well, with supporting mathematical explanations and equations. However, the changes made in the paper seem primitive to the previous iteration of MViT and do not significantly improve the results. The paper also does not discuss the possible limitations.
Experimental Results:
In this study, multiscale vision transformers were used to test three downstream tasks.
Classification: The state-of-the-art on the ImageNet leaderboard, MViTv2 achieves roughly 1% greater accuracy with 10% fewer flops and parameters compared to MViTv1 and considerable gains over Swin and DeiT transformers. Additionally, SOTA’s performance in the community’s most recent full-size crop testing is reported by MViTv2.
Object Detection: In terms of object detection, the MaskRCNN framework’s MViTv2 backbone outperformed transformer-based backbones (Swin, ViL, and MViTv1) as well as CNN backbones (ResNet, ResNext, etc.). MViTv2 displays a 2.5 AP box/AP mask improvement over Swin transformer backbone. Similar encouraging patterns were observed in tests using the Cascade MaskRCNN architecture.
Self-attention ablation studies: In COCO and ImageNet (IN) assessments, the MViTv2 attention mechanisms (pooling and Hwin) are contrasted with alternative attention mechanisms (global(full), windowed, and Shiftedwindow). The following findings were made:
When there are no cross-window connections in the default window of ViTB-based models, accuracy in the top-1 reduces by 2% but FLOPs and memory consumption go down. Hwin window outperforms Swin by +1.7% while Swin attention only recovers +0.4%. With 38% fewer failures, attention pooling exhibits the best accuracy/computational trade-offs. When Hwin and attention are combined, object detection performance is at its highest.
Extensions:
This paper can be extended to various vision applications such as instance segmentation as a vision backbone. It can also be used in applications such as visual inspection and quality management in manufacturing, cancer and tumor detection in healthcare, and vehicle re-identification and pedestrian detection in transportation.
Comments:
MViT, which introduced general hierarchical architecture structure to visual recognition, showed better performance than other ViT architectures (e.g. Swin Transformer) and achieved SOTA in various tasks such as classification, object detection, instance segmentation, and video recognition. Transformer-based architecture is expected to be widely used in various studies in the future as it is rapidly developing.