Utilizing Vision Large Language Models for Automatic Image Annotations: A Comparative Study

TitleUtilizing Vision Large Language Models for Automatic Image Annotations: A Comparative Study
Publication TypeConference Proceedings
Year of Publication2025
Date Published01-2025
Conference Name16th Conference ICT Innovations conference 2024
AuthorsAlmisreb, AAbd, Namas, T, Buyukdagli, O, Cantelli-Forti, A, Jajaga, E, Ismail, N
KeywordsGrounding-DINO-Tiny, image annotation, OWLv2
Abstract

Image annotations can be a time-consuming task. This study looks into
how well the OWLv2 and Grounding-DINO-Tiny models can annotate objects
in four categories: airplanes, birds, drones, and helicopters. We revealed the pre
liminary results or findings as follows by comparing the confidence scores and
the detection rate. The Grounding-DINO-Tiny model was quite successful, of
fering no empty frames and relatively high confidence scores most of the time
for the distinguished categories such as the helicopter and drone. Still, it fared
poorly in birds, having lower confidence scores or more annotations with a value
less than 50% which signifies the model’s weakness in identifying birds. The
proposed model, OWLv2, had fairly moderate outcomes and the quality of data
differed from one category to the other which undermined the reliability of the
model. For the enhancement of the future performance, there are several recom
mendations that we make; these include; improving the ability to identify birds,
eliminating inconsistency in the datasets, and improving on the quality of the data
gathered.

Refereed DesignationRefereed