Ritvik Rastogi

Sep 8, 2024

8 stories

5 saves

Vision Transformers

An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.
A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
A hybrid neural network built upon the ViT architecture and DeiT training method, for fast inference image classification.
Improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions, to yield the best of both designs.
A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer.
Ritvik Rastogi

Ritvik Rastogi

Data Scientist, 2x Kaggle Expert