Publication Type : Conference Proceedings
Publisher : IEEE
Source : 2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS)
Url : https://doi.org/10.1109/icicacs65178.2025.10967925
Campus : Bengaluru
School : School of Computing
Year : 2025
Abstract : Image captioning, integrating computer vision and natural language processing, has become one of the critical areas of work for better accessibility and advancing technology. In this study, the performance of pre-trained convolutional neural networks-DenseNet, VGG-16, ResNet, and Xception-has been evaluated when coupled with LSTM units for generating captions on Flickr8k. These models have been evaluated on the basis of BLEU and ROUGE scores. The best scores were produced by ResNet-LSTM, which indicates the model's stronger ability to produce captions that are accurate and contextual. VGG-16 with LSTM ranked second; it was almost on par. Xception with LSTM gave results that were moderate. DenseNet with LSTM had the worst score, indicating that the feature extraction mechanism of this model is weak. The following results provide evidence for the practical use of ResNet-LSTM to facilitate access: from assistive technology for visually impaired users to automatically generated content. The comparative study would further indicate which one has more merits or drawbacks compared with others and thereby enable a more informed selection of the models suitable for each particular application. © 2025 IEEE.
Cite this Research Publication : L D Mukil, Tripty Singh, Mansi Sharma, Comparing the Pre-Trained Models with LSTM Combination to Improve Image Captioning Technology, 2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS), IEEE, 2025, https://doi.org/10.1109/icicacs65178.2025.10967925