CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting

IRIS

Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.

Marchetti, F., Mordan, T., Becattini, F., Seidenari, L., Bimbo, A.D., Alahi, A. (2024). CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 1-10 [10.1109/tiv.2024.3449046].

CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting

Marchetti, Francesco;Mordan, Taylor;Becattini, Federico;Seidenari, Lorenzo;Bimbo, Alberto Del;Alahi, Alexandre

2024-01-01

Abstract

Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Rivista su cui è pubblicata l'opera
	
				IEEE TRANSACTIONS ON INTELLIGENT VEHICLES
			
	Citazione
	
				Marchetti, F., Mordan, T., Becattini, F., Seidenari, L., Bimbo, A.D., Alahi, A. (2024). CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 1-10 [10.1109/tiv.2024.3449046].
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
CrossFeat_Semantic_Cross-modal_Attention_for_Pedestrian_Behavior_Forecasting.pdf accesso aperto Tipologia: Pre-print Licenza: Creative commons Dimensione 4.08 MB Formato Adobe PDF Visualizza/Apri	4.08 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1277217