Famous Models =============== ------------------------------------------------------------------------------ .. admonition:: contents .. container:: blue-box * `grounding DINO `__ 1. `Introduction `__ 2. `Grounding DINO Performance `__ 3. `Advantages of Grounding DINO `__ 4. `Grounding DINO Architecture `__ * `Segment Anyting Model `__ 1. `Introduction to SAM `__ 2. `What is the Segment Anything Model? `__ 3. `SAM’s network architecture `__ 4. `How does SAM support real-life cases `__ 5. `Reference `__ ---------------------------------------------------------------------------------------------------------- .. raw:: html

.. figure:: /Documentation/images/foundation-models/grounding-DINO/1.jpg :width: 700 :align: center :alt: Alternative text for the image --------------------------------------------------------------------------------- Grounding DINO --------------- .. raw:: html

As of March 2023, there is a new SOTA zero-shot object detection model - Grounding DINO. In this post, we will talk about the advantages of Grounding DINO, analyze the model architecture, and provide real prompt examples.

1. Introduction _________________________ .. raw:: html

Most object detection models are trained to identify a narrow predetermined collection of classes. The main problem with this is the lack of flexibility. Every time you want to expand or change the set of recognizable objects, you have to collect data, label it, and train the model again. This — of course — is time-consuming and expensive.

Zero-shot detectors want to break this status quo by making it possible to detect new objects without re-training a model. All you have to do is change the prompt and the model will detect the objects you describe.

Below we see two images visualizing predictions made with Grounding DINO — the new SOTA zero-shot object detection model.

In the case of the images below, we asked the model to identify the class " 'piano', 'guitar','phone','hat' " a class belonging to the COCO dataset. The model successfully detected all objects of this class without any issues.

text prompt :['piano', 'guitar', 'phone', 'hat'] .. figure:: /Documentation/images/foundation-models/grounding-DINO/2.jpg :width: 700 :align: center :alt: Alternative text for the image .. figure:: /Documentation/images/foundation-models/grounding-DINO/3.jpg :width: 700 :align: center :alt: Alternative text for the image .. figure:: /Documentation/images/foundation-models/grounding-DINO/4.jpg :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

2. Grounding DINO Performance _______________________________ .. raw:: html

Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark — without any training data from COCO. After finetuning with COCO data, Grounding DINO reaches 63.0 AP . It sets a new record on the ODinW zero-shot benchmark with a mean of 26.1 AP.

*GLIP T vs. Grounding DINO T speed and mAP comparison* .. figure:: /Documentation/images/foundation-models/grounding-DINO/5.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

3. Advantages of Grounding DINO ________________________________ .. raw:: html

Zero-Shot Object Detection — Grounding DINO excels at detecting objects even when they are not part of the predefined set of classes in the training data. This unique capability enables the model to adapt to novel objects and scenarios, making it highly versatile and applicable to various real-world tasks.

Referring Expression Comprehension (REC) — Identifying and localizing a specific object or region within an image is based on a given textual description. In other words, instead of detecting people and chairs in an image and then writing custom logic to determine whether a chair is occupied, prompt engineering can be used to ask the model to detect only those chairs where a person is sitting. This requires the model to possess a deep understanding of both the language and the visual content, as well as the ability to associate words or phrases with corresponding visual elements.

Elimination of Hand-Designed Components like NMS — Grounding DINO simplifies the object detection pipeline by removing the need for hand-designed components, such as Non-Maximum Suppression (NMS). This streamlines the model architecture and training process while improving efficiency and performance.

.. admonition:: For more information .. container:: blue-box * `Find the link to "Non-Maximum Suppression (NMS)." `__ * `Find the link to "How to Code Non-Maximum Suppression (NMS) in Plain NumPy." `__ .. raw:: html

4. Grounding DINO Architecture ________________________________ .. raw:: html

Model architecture

Grounding DINO aims to merge concepts found in the DINO and GLIP papers. DINO, a transformer-based detection method, offers state-of-the-art object detection performance and end-to-end optimization, eliminating the need for handcrafted modules like NMS (Non-Maximum Suppression).

On the other hand, GLIP focuses on phrase grounding. This task involves associating phrases or words from a given text with corresponding visual elements in an image or video, effectively linking textual descriptions to their respective visual representations.

Text backbone and Image backbone — Multiscale image features are extracted using an image backbone like Swin Transformer, and text features are extracted with a text backbone like BERT.

.. figure:: /Documentation/images/foundation-models/grounding-DINO/10.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

The output of these two streams are fed into a feature enhancer for transforming the two sets of features into a single unified representation space. The feature enhancer includes multiple feature enhancer layers. Deformable self-attention is utilized to enhance image features, and regular self-attention is used for text feature enhancers.

.. figure:: /Documentation/images/foundation-models/grounding-DINO/7.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

Grounding DINO aims to detect objects from an image specified by an input text. In order to effectively leverage the input text for object detection, a language-guided query selection is used to select most relevant features from both the image and text inputs. These queries guide the decoder in identifying the locations of objects in the image and assigning them appropriate labels based on the text descriptions.

.. figure:: /Documentation/images/foundation-models/grounding-DINO/8.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

A cross-modality decoder is then used to integrate text and image modality features. The cross-modality decoder operates by processing the fused features and decoder queries through a series of attention layers and feed-forward networks. These layers allow the decoder to effectively capture the relationships between the visual and textual information, enabling it to refine the object detections and assign appropriate labels. After this step, the model proceedes with the final steps in the object detection including bounding box prediction, class specific confidence filtering and label assignment.

How it works?

Here is how Grounding DINO would work on this image: .. figure:: /Documentation/images/foundation-models/grounding-DINO/8.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

The model will first use its understanding of language to identify the objects that are mentioned in the text prompt. For example, in the description “two dogs with a stick,” the model would identify the words “dogs” and “stick” as objects

The model will then generate a set of object proposals for each object that was identified in the natural language description. The object proposals are generated using a variety of features such as the color, shape, and texture of the objects

Next, the score for each object proposal is returned by the model. The score is a measure of how likely it is that the object proposal contains an actual object

The model would then select the top-scoring object proposals as the final detections. The final detections are the objects that the model is most confident are present in the image

In this case, the model would likely detect the two dogs and the stick in the image. The model would also likely score the two dogs higher than the stick, because the dogs are larger and more prominent in the image.

5. Reference _____________________ .. admonition:: source .. container:: blue-box * Find the link to `"Grounded Language-Image Pre-training." `__ * Find the link to `"DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection" `__ * Find the link to `"Non-Maximum Suppression (NMS)." `__ * Find the link to `"How to Code Non-Maximum Suppression (NMS) in Plain NumPy." `__ .. raw:: html

-------------------------------------------------------------------------------------- .. figure:: /Documentation/images/foundation-models/SAM/samm.jpg :width: 700 :align: center :alt: Alternative text for the image Segment Anyting Model ------------------------- ------------------------------------------------------------------------------------ .. figure:: /Documentation/images/foundation-models/SAM/SAM.png :width: 700 :align: center :alt: Alternative text for the image .. raw:: html

Welcome to the cutting edge of image segmentation with the Segment Anything model, or SAM. This groundbreaking model has changed the game by introducing real-time image segmentation, setting new standards in the field.

.. raw:: html
'
1. Introduction to SAM: _________________________ .. figure:: /Documentation/images/foundation-models/SAM/1.jpg :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
The Segment Anything model, or SAM, is a cutting-edge image segmentation model that allows for fast segmentation, offering unparalleled versatility in image analysis tasks. SAM is at the core of the Segment Anything initiative, a groundbreaking project that introduces a new model, a new task, and a new dataset for image segmentation.

SAM's advanced software design enables it to adapt to new image distributions and tasks without prior knowledge, a feature known as zero-shot transfer. Trained on the extensive SA-1B dataset, which contains over a billion masks spread across 11 million carefully selected images, SAM has displayed impressive performance in image absence, surpassing in many cases previous fully supervised results.
.. admonition:: source .. container:: blue-box * `Find the link to "SA-1B Dataset." `__ .. raw:: html
In this article, we’ll provide SAM’s technical breakdown, take a look at its current use cases, and talk about its impact on the future of computer vision.
.. raw:: html
'
2. What is the Segment Anything Model? _______________________________________ .. raw:: html
SAM is designed to revolutionize the way we approach image analysis by providing a versatile and adaptable foundation model for segmenting objects and regions within images.

Unlike traditional image segmentation models that require extensive task-specific modeling expertise, SAM eliminates the need for such specialization. Its primary objective is to simplify the segmentation process by serving as a foundational model that can be prompted with various inputs, including clicks, boxes, or text, making it accessible to a broader range of users and applications.
.. admonition:: source .. container:: blue-box * `Find the link to "image segmentation" `__ * `Find the link to "foundation models guide" `__ .. raw:: html
'
.. figure:: /Documentation/images/foundation-models/SAM/2.webp :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
What sets SAM apart is its ability to generalize to new tasks and image domains without the need for custom data annotation or extensive retraining. SAM accomplishes this by being trained on a diverse dataset of over 1 billion segmentation masks, collected as part of the Segment Anything project. This massive dataset enables SAM to adapt to specific segmentation tasks, similar to how prompting is used in natural language processing models.

SAM's versatility, real-time interaction capabilities, and zero-shot transfer make it an invaluable tool for various industries, including content creation, scientific research, augmented reality, and more, where accurate image segmentation is a critical component of data analysis and decision-making processes.
.. admonition:: source .. container:: blue-box * `Find the link to "segmentation masks" `__ .. raw:: html
'
3. SAM's network architecture _____________________________ .. raw:: html
SAM’s revolutionary capabilities are primarily based on its revolutionary architecture, which consists of three main components: the image encoder, prompt encoder, and mask decoder

'
.. figure:: /Documentation/images/foundation-models/SAM/3.png :width: 700 :align: center :alt: Alternative text for the image *The Segment Anything (SA) project introduces a new task, model, and dataset for image segmentation* .. raw:: html
'
.. figure:: /Documentation/images/foundation-models/SAM/4.jpg :width: 700 :align: center :alt: Alternative text for the image *The architecture of the segment anything model (SAM). The SAM consists of the following components: An Image Encoder, a Decoder, and a Mask Decoder* .. raw:: html
'

✓ Image Encoder
.. figure:: /Documentation/images/foundation-models/SAM/10.jpg :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
The image encoder is at the core of SAM’s architecture, a sophisticated component responsible for processing and transforming input images into a comprehensive set of features.

Using a transformer-based approach, like what’s seen in advanced NLP models, this encoder compresses images into a dense feature matrix. This matrix forms the foundational understanding from which the model identifies various image elements.
.. admonition:: source .. container:: blue-box * `Find the link to "NLP models" `__ .. raw:: html
✓ prompt Encoder
.. figure:: /Documentation/images/foundation-models/SAM/11.jpg :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
The prompt encoder is a unique aspect of SAM that sets it apart from traditional image segmentation models.

It interprets various forms of input prompts, be they text-based, points, rough masks, or a combination thereof.

This encoder translates these prompts into an embedding that guides the segmentation process. This enables the model to focus on specific areas or objects within an image as the input dictates.
.. raw:: html
✓ Mask Decoder
.. figure:: /Documentation/images/foundation-models/SAM/8.jpg :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
'
.. figure:: /Documentation/images/foundation-models/SAM/9.png :width: 700 :align: center :alt: Alternative text for the image .. raw:: html
The mask decoder is where the magic of segmentation takes place. It synthesizes the information from both the image and prompt encoders to produce accurate segmentation masks.

This component is responsible for the final output, determining the precise contours and areas of each segment within the image.

How these components interact with each other is equally vital for effective image segmentation as their capabilities:

The image encoder first creates a detailed understanding of the entire image, breaking it down into features that the engine can analyze.

The prompt encoder then adds context, focusing the model’s attention based on the provided input, whether a simple point or a complex text description.

Finally, the mask decoder uses this combined information to segment the image accurately, ensuring that the output aligns with the input prompt’s intent.
.. raw:: html
'
.. admonition:: source .. container:: blue-box * `Read more at "segment anything model sam explained" `__ .. raw:: html
'
4. How does SAM support real-life cases? ___________________________________________ * **Versatile segmentation:** .. raw:: html
SAM's promptable interface allows users to specify segmentation tasks using various prompts, making it adaptable to diverse real-world scenarios.

For example, SAM's versatile segmentation capabilities find application in environmental monitoring, where it can analyze ecosystems, detect deforestation, track wildlife, and assess land use. For wetland monitoring, SAM can segment aquatic vegetation and habitats. In deforestation detection, it can identify areas of forest loss. In wildlife tracking, it can help analyze animal behavior, and in land use analysis, it can categorize land use in aerial imagery. SAM's adaptability enables valuable insights for conservation, urban planning, and environmental research.

SAM can be asked to segment everything in an image, or it can be provided with a bounding box to segment a particular object in the image, as shown below on an example from the COCO dataset.
.. figure:: /Documentation/images/foundation-models/SAM/12.webp :width: 700 :align: center :alt: Alternative text for the image * **Zero-Shot Transfer:** .. raw:: html
SAM's ability to generalize to new objects and image domains without additional training (zero-shot transfer) is invaluable in real-life applications. Users can apply SAM "out of the box" to new image domains, reducing the need for task-specific models.

Zero-shot transfer in SAM can streamline fashion retail by enabling e-commerce platforms to effortlessly introduce new clothing lines. SAM can instantly segment and present new fashion items without requiring specific model training, ensuring a consistent and professional look for product listings. This accelerates the adaptation to fashion trends, making online shopping experiences more engaging and efficient.
Real-Time Interaction: .. raw:: html
SAM's efficient architecture enables real-time interaction with the model. This is crucial for applications like augmented reality, where users need immediate feedback, or content creation tasks that require rapid segmentation.
**Multimodal Understanding:** .. raw:: html
SAM's promptable segmentation can be integrated into larger AI systems for more comprehensive multimodal understanding, such as interpreting both text and visual content on webpages.
**Efficient Data Annotation:** .. raw:: html
SAM's data engine accelerates the creation of large-scale datasets, reducing the time and resources required for manual data annotation. This benefit extends to researchers and developers working on their own segmentation tasks.
**Equitable Data Collection:** .. raw:: html
SAM's dataset creation process aims for better representation across diverse geographic regions and demographic groups, making it more equitable and suitable for real-world applications that involve varied populations.
**Content Creation and AR/VR:** .. raw:: html
SAM's segmentation capabilities can enhance content creation tools by automating object extraction for collages or video editing. In AR/VR, it enables object selection and transformation, enriching the user experience.
**Scientific Research:** .. raw:: html
SAM's ability to locate and track objects in videos has applications in scientific research, from monitoring natural occurrences to studying phenomena in videos, offering insights and advancing various fields.
.. admonition:: Overall .. container:: blue-box * *SAM's versatility, adaptability, and real-time capabilities make it a valuable tool for addressing real-life image segmentation challenges across diverse industries and applications.* .. raw:: html
'
5. Reference ___________________ .. admonition:: source .. container:: blue-box * Find the link to `"segment anything model sam explained" `__ * Find the link to `"segment anything model sam paper" `__ * Find the link to `"SA-1B Dataset." `__ * Find the link to `"image segmentation" `__ * Find the link to `"foundation models guide" `__ * Find the link to `"segmentation masks" `__ * Find the link to `"NLP models" `__ * Find the link to `"segment anything model" `__