paint-brush
New AI Model Could Redefine How Machines Describe Imagesby@autoencoder
New Story

New AI Model Could Redefine How Machines Describe Images

tldt arrow

Too Long; Didn't Read

Researchers at the Mohamed bin Zayed University developed an AI model that can create text-based conversations tied to specific objects or regions in an image.
featured image - New AI Model Could Redefine How Machines Describe Images
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
0-item

Authors:

(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;

(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;

(3) Sahal Shaji, Mohamed bin Zayed University of AI;

(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;

(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;

(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;

(9) Ming-Hsuan Yang, University of California - Merced and Google Research;

(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.


Supplementary Material (Part 1)


Supplementary Material (Part 2)

5. Experiments

We perform quantitative evaluations of GLaMM on six benchmarks: i) Grounded Conversation Generation (GCG), ii) referring-expression segmentation, iii) region-level captioning, iv) image-level captioning, v) conversational-style question answering and vi) phrase grounding. We present the first four benchmarks next, and the remaining are discussed in Appendix B.



Referring Expression Segmentation. In this task, the model processes an image and a text-based referring expression to output a segmentation mask. The prompt used is, “Please segment the <referring expression> in the image.” The model responds with "“Sure, it is <SEG>.“, where the <SGE> token is decoded to obtain the mask. We achieve better results over recent works like LISA on the refCOCO, refCOCO+, and refCOCOg validation and test sets in Tab. 4. This demonstrates the efficacy of our GranD dataset, offering the model extensive concept vocabulary during pre-training (refer to Fig. 5 (middle) and supplementary Fig. 8 for qualitative results).


Figure 5. Qualitative results of GLaMM’s performance across downstream tasks. The figure showcases examples from three tasks: region-level understanding (left), referring-expression segmentation (center), and image-level captioning (right), demonstrating its capabilities in offering in-depth region understanding, pixel-level groundings, and conversational abilities through an end-to-end training approach.


Table 5. Performance of GLaMM in Region-Level Captioning: Metrics include METEOR and CIDEr scores, assessed on Visual Genome and refCOCOg Datasets, exhibiting competitive results.


Table 6. Performance of GLaMM in Zero-Shot Image Captioning: Assessed on Flickr30k and NoCap datasets, showing favorable results compared to recent models in the field.


Region Level Captioning. In this task, models generate region-specific captions given an image, a user-specified region via a bounding box and related text. We utilize a prompt like, “Can you provide a detailed description of the region <bbox>?”, to instruct the model for this task, where the special token <bbox> is replaced with the actual region representations. We evaluate GLaMM on Visual Genome and refCOCOg, using METEOR and CIDEr metrics with results presented in Tab. 5. GLaMM shows improved results over GRiT and GPT4RoI after fine-tuning and demonstrates robust zero-shot performance, highlighting the significance of GranD’s region-text pairs (refer to Fig.5 (left) and supplementary Fig.9 for qualitative results).


Image Level Captioning. For this task, GLaMM responds to queries like, “Could you please give me a detailed description of the image?" with a textual description. We evaluate GLaMM’s zero-shot performance on Flickr30k [37] and NoCap [1] datasets, with Tab. 6 showing its favorable performance against recent image captioning models and other LMMs (refer to Fig. 5 (right) and supplementary Fig. 10 for qualitative results).


Refer to Appendix C for qualitative results on six downstream tasks, as well as conditional image generation.


This paper is available on arxiv under CC BY 4.0 DEED license.