Oct 25, 2024
12 stories
5 saves
A foundation model towards solving promptable visual segmentation in images and videos based on a simple transformer architecture with streaming memory for real-time video processing.
Introduces a novel image segmentation task, model, and dataset, aiming to enable prompt-able, zero-shot transfer learning in computer vision.
Employs Vision Transformers, CLIP-based contrastive pre-training, and bipartite matching loss for open-vocabulary detection, utilizing image-level pre-training, multihead attention pooling, and mosaic image augmentation.
A novel transformers based object detection model that treats object detection as a set prediction problem, eliminating the need for hand-designed components.
Proposes a multi-stage approach where detectors are trained with progressively higher IoU thresholds, improving selectivity against false positives.
Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Feb 7, 2023
Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.