ABSTRACT: Describingan image is an effortless task for human beings. But achieving the same bycomputers, without the aid of humans is a challenging task. To correctlydescribe an image, accurate recognition of objects, their attributes,relationships and scene information is required. This idea of describingcontents of an image can have a lot of use in social networking apps or softwarethat want to process visual data. In this paper, we aim to generate sentencesbased on images using deep neural networks. We have used MS-COCO dataset totrain the Convolutional Neural Network, and have used PASCAL VOC to fine-tuneit. The labels are everyday occurring objects. The sentences are generatedusing a fixed template. It detects maximum of four objects and describes thelocalization between them [above, below, left and right]. This system achievesan accuracy of 72.7\% when evaluated against a subset of MS-COCO dataset forobject classification.
KEYWORDS: ObjectClassification, Deep Learning, Convolutional Neural Network, BatchNormalization, ReLU