Vignesh Ravikumar | Masked Autoencoders Are Scalable Vision Learners

Citation: He, Kaiming et al. “Masked Autoencoders Are Scalable Vision Learners.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 15979-15988.

Brief Summary:
An autoencoder is used to learn efficient coding of unlabeled data. By making an attempt to regenerate the input from the encoding, the encoding is validated and improved. By teaching the network to disregard irrelevant data (or “noise”), the autoencoder learns a representation (encoding) for a set of data, generally for dimensionality reduction. This paper introduces Masked Autoencoders where random patches of the input images are masked and the rebuilding is done on the missing pixels. On a standard ViT-Huge model calibrated on ImageNet-1K, they use this scalable self-supervised architecture and obtain 87.8% accuracy with a 3X shorter pre-training time, making MAE scalable for huge models.

Main Contributions:
An asymmetric architecture that enables the encoder to work solely on the partial, seen signal (without mask tokens) and use a lightweight decoder that reconstructs the entire signal from the latent representation and the mask tokens. Masking the majority of inputs generates a nontrivial and meaningful self-supervisory task.

Strengths:
This paper leverages on Vision Transformer’s ability to bridge the architectural gap that CNNs were possessing - ability to integrate mask tokens which were not possible on CNNs. The concept is very scalable and simple to deploy. The paper is well-written; the code is provided for reproducibility, and has enough mathematical explanations and supporting figures.

Weakness:
There is no mathematical formulation or justification for their method.

In-Depth Analysis of Strengths and Weaknesses:
The suggested MAE in this study is a straightforward autoencoder that makes full use of partial observation (the input picture is incomplete) to produce a complete image. With the exception of its unique asymmetric architecture, this autoencoder is nearly identical to other earlier (classical) autoencoders. The model can avoid training on every pixel in the image thanks to this approach. Masking the majority of inputs generates a nontrivial and meaningful self-supervisory task. The success of both designs, as previously indicated, demonstrates our ability to train our models with big train datasets successfully and efficiently. It indicates that the accuracy has increased and that acceleration training has sped up by at least three times. This scalable approach makes it viable for learning high-capacity models that generalize well (this can be seen in experiment results). However, the paper does not provide enough mathematical formulations or justifications for their method.

Experimental Results:
Ablation studies were performed on several parts of the architecture such as masking ratio, decoder design, mask token, reconstruction target, data augmentation, mask sampling strategy, and training schedule.
Comparisons with previous results:

Comparisons with self-supervised methods: This study reveals that MAE is readily scaled up and exhibits consistent progress with larger models. ViT-H provides 86.9% accuracy for MAE (224 size). A 448 size is used for fine-tuning, which results in 87.8% accuracy. BEiT performs worse than MAE despite being faster and simpler to use. Comparisons with supervised pre-training: When trained in IN1K, ViT-L degrades, but MAE performs better, and accuracy reaches saturation. MAE performs better with models of greater capacity.
Partial Fine-tuning: Results from probing and fine-tuning are mostly uncorrelated. Only one Transformer block has to be fine-tuned in order to considerably increase accuracy from 73.5% to 81.0%. When compared to MAE, MoCo v3 has a greater linear probing accuracy, but all of its partial fine-tuning outcomes are inferior to those of MAE. When four blocks are tuned, the gap is 2.6%.
Transfer Learning Experiments: Object detection and segmentation: MAE outperforms supervised, MoCo v3 and BEiT with an APbox of 50.3 for ViT-B and 53.3 for ViT-L. The token-based BEiT is inferior to or on par with the pixel-based MAE, which is also more quicker and easier to use.
Semantic segmentation: Results from pre-training are substantially better than from supervised pre training. The MAE based on pixels surpasses the BEiT based on tokens.
Classification task: The study technique exhibits strong scaling behavior on iNat. With larger models, accuracy increases noticeably and outperforms the prior best results by a wide margin.
Pixels vs. tokens: A study demonstrates that tokenization is not required for MAE since the use of dVAE tokens, while superior to unnormalized pixels in certain circumstances, is statistically equivalent to using normalized pixels in all cases.

Extensions:
Dimensionality reduction, image compression, image denoising, feature extraction, sequence-to-sequence prediction, and recommendation systems are other applications for MAE.

Comments:
In real-world computer vision tasks, this initiative technique is advantageous and useful, particularly for difficult problems with complexity. The model uses less energy and requires fewer inputs for pre-training by removing random patches.