List: Vision Transformers | Curated by Ritvik Rastogi

Nov 12, 2024
16 stories
10 saves
Vision Transformers
A shape-optimized vision transformer that achieves competitive results with models twice its size, while being pre-trained with an equivalent amount of compute.
Ritvik Rastogi
Papers Explained 234: SoViTThis paper introduces advanced methods for inferring compute-optimal model shapes, such as width and depth, challenging the prevailing…
Oct 18, 2024
Oct 18, 2024
Employs a single memory-bound MHSA between efficient FFN layers, improves memory efficiency while enhancing channel communication.
Ritvik Rastogi
Papers Explained 229: Efficient ViTEfficient Vit employs a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers…
Oct 11, 2024
Oct 11, 2024
A hybrid vision transformer architecture featuring a novel token mixing operator called RepMixer, which significantly improves model efficiency.
Ritvik Rastogi
Papers Explained 225: FastViTFastViT is a hybrid vision transformer architecture featuring a novel token mixing operator called RepMixer, which significantly improves…
Oct 4, 2024
Oct 4, 2024
Revisits the design principles of ViT and its variants through latency analysis and identifies inefficient designs and operators in ViT to propose a new dimension consistent design paradigm for vision transformers and a simple yet effective latency-driven slimming method to optimize for inference speed.
Ritvik Rastogi
Papers Explained 220: EfficientFormerEfficient Former is a family of models optimized for inference speed. The paper revisits the design principles of ViT and its variants…
Sep 27, 2024
Sep 27, 2024
A successor to Swin Transformer, addressing challenges like training stability, resolution gaps, and labeled data scarcity.
Ritvik Rastogi
Papers Explained 215: Swin Transformer V2Swin Transformer v2 explores large-scale models in computer vision, addressing challenges like training stability, resolution gaps, and…
Sep 20, 2024
Sep 20, 2024
Introduces multi-axis attention, allowing global-local spatial interactions on arbitrary input resolutions with only linear complexity.
Ritvik Rastogi
Papers Explained 210: MaxViTMax ViT introduces an efficient and scalable attention model called multi-axis attention, consisting of two aspects: blocked local and…
Sep 13, 2024
Sep 13, 2024
Demonstrates that existing self-supervised pre-training methods can produce general-purpose visual features by training on curated data from diverse sources, and proposes a new approach that combines techniques to scale pre-training with larger models and datasets.
Ritvik Rastogi
Papers Explained 250: DINO v2This work demonstrates that existing pre-training methods, especially self-supervised methods, can produce general purpose visual features…
Nov 12, 2024
Nov 12, 2024
An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 28: Masked AutoEncoderThe appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining. The solutions…
Feb 9, 2023
Feb 9, 2023
A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 40: MobileViTMobileViT is a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the…
Apr 3, 2023
Apr 3, 2023
Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 27: BEiTIt is challenging to directly apply BERTstyle pre-training for image data. First of all, there is no pre-exist vocabulary for vision…
Feb 9, 2023
Feb 9, 2023
A hybrid neural network built upon the ViT architecture and DeiT training method, for fast inference image classification.
Ritvik Rastogi
Papers Explained 205: LeViTLeViT is a hybrid neural network for fast inference image classification. LeViT significantly outperforms existing convnets and vision…
Sep 8, 2024
Sep 8, 2024
Investigates whether self-supervised learning provides new properties to Vision Transformer that stand out compared to convolutional networks and finds that self-supervised ViT features contain explicit information about the semantic segmentation of an image, and are also excellent k-NN classifiers.
Ritvik Rastogi
Papers Explained 249: DINOThis paper investigates whether self-supervised learning enhances Vision Transformer performance compared to convolutional networks…
Nov 11, 2024
Nov 11, 2024
Improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions, to yield the best of both designs.
Ritvik Rastogi
Papers Explained 199: CvTConvolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT…
Aug 31, 2024
Aug 31, 2024
A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 26: Swin TransformerSwin Transformer constructs a hierarchical representation by starting from small-sized patches and gradually merging neighboring patches in…
Feb 9, 2023
Feb 9, 2023
A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 39: DeiTDeiT is a competitive convolution-free transformer trained on Imagenet only. It introduces a teacher-student strategy specific to…
Mar 27, 2023
Mar 27, 2023
Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer.
In
DAIR.AI
by
Ritvik Rastogi
Papers Explained 25: Vision TransformersInspired by the Transformer scaling successes in NLP, the idea is to apply a standard Transformer directly to images, with the fewest…
Feb 9, 2023
Feb 9, 2023