Utilizing Vision Large Language Models for Automatic Image Annotations: A Comparative Study

Ali Abd Almisreb; Tarik Namas; Ozge Buyukdagli; Alessandro Cantelli-Forti; Edmond Jajaga; Nurlaila Ismail

Title	Utilizing Vision Large Language Models for Automatic Image Annotations: A Comparative Study
Publication Type	Conference Proceedings
Year of Publication	2025
Date Published	01-2025
Conference Name	16th Conference ICT Innovations conference 2024
Authors	Almisreb, AAbd, Namas, T, Buyukdagli, O, Cantelli-Forti, A, Jajaga, E, Ismail, N
Keywords	Grounding-DINO-Tiny, image annotation, OWLv2
Abstract	Image annotations can be a time-consuming task. This study looks into how well the OWLv2 and Grounding-DINO-Tiny models can annotate objects in four categories: airplanes, birds, drones, and helicopters. We revealed the pre liminary results or findings as follows by comparing the confidence scores and the detection rate. The Grounding-DINO-Tiny model was quite successful, of fering no empty frames and relatively high confidence scores most of the time for the distinguished categories such as the helicopter and drone. Still, it fared poorly in birds, having lower confidence scores or more annotations with a value less than 50% which signifies the model’s weakness in identifying birds. The proposed model, OWLv2, had fairly moderate outcomes and the quality of data differed from one category to the other which undermined the reliability of the model. For the enhancement of the future performance, there are several recom mendations that we make; these include; improving the ability to identify birds, eliminating inconsistency in the datasets, and improving on the quality of the data gathered.
Refereed Designation	Refereed