The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.

Bongini, P., Becattini, F., Del Bimbo, A. (2023). Is GPT-3 All You Need for Visual Question Answering in Cultural Heritage?. In Computer Vision – ECCV 2022 Workshops. ECCV 2022 (pp.268-281). Cham : Springer [10.1007/978-3-031-25056-9_18].

Is GPT-3 All You Need for Visual Question Answering in Cultural Heritage?

Becattini F.;
2023-01-01

Abstract

The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.
2023
978-3-031-25055-2
978-3-031-25056-9
Bongini, P., Becattini, F., Del Bimbo, A. (2023). Is GPT-3 All You Need for Visual Question Answering in Cultural Heritage?. In Computer Vision – ECCV 2022 Workshops. ECCV 2022 (pp.268-281). Cham : Springer [10.1007/978-3-031-25056-9_18].
File in questo prodotto:
File Dimensione Formato  
VQA_GPT_3 (1).pdf

non disponibili

Tipologia: Pre-print
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 482.48 kB
Formato Adobe PDF
482.48 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1230155