ECO: Ensembling Context Optimization for Vision-Language Models

IRIS

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

Agnolucci, L., Baldrati, A., Todino, F., Becattini, F., Bertini, M., Del Bimbo, A. (2023). ECO: Ensembling Context Optimization for Vision-Language Models. In 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (pp.2803-2807). New York : IEEE [10.1109/ICCVW60793.2023.00299].

ECO: Ensembling Context Optimization for Vision-Language Models

Agnolucci L.;Baldrati A.;Todino F.;Becattini F.;Bertini M.;Del Bimbo A.

2023-01-01

Abstract

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Codice ISBN
	
				979-8-3503-0744-3
			
	Citazione
	
				Agnolucci, L., Baldrati, A., Todino, F., Becattini, F., Bertini, M., Del Bimbo, A. (2023). ECO: Ensembling Context Optimization for Vision-Language Models. In 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (pp.2803-2807). New York : IEEE [10.1109/ICCVW60793.2023.00299].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Agnolucci_ECO_Ensembling_Context_Optimization_for_Vision-Language_Models_ICCVW_2023_paper.pdf accesso aperto Tipologia: Post-print Licenza: PUBBLICO - Pubblico con Copyright Dimensione 1.38 MB Formato Adobe PDF Visualizza/Apri	1.38 MB	Adobe PDF	Visualizza/Apri
ECO_Ensembling_Context_Optimization_for_Vision-Language_Models.pdf non disponibili Tipologia: PDF editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.12 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.12 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1277507