Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.

Marchetti, F., Mordan, T., Becattini, F., Seidenari, L., Bimbo, A.D., Alahi, A. (2024). CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 1-10 [10.1109/tiv.2024.3449046].

CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting

Becattini, Federico;
2024-01-01

Abstract

Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.
2024
Marchetti, F., Mordan, T., Becattini, F., Seidenari, L., Bimbo, A.D., Alahi, A. (2024). CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 1-10 [10.1109/tiv.2024.3449046].
File in questo prodotto:
File Dimensione Formato  
CrossFeat_Semantic_Cross-modal_Attention_for_Pedestrian_Behavior_Forecasting.pdf

accesso aperto

Tipologia: Pre-print
Licenza: Creative commons
Dimensione 4.08 MB
Formato Adobe PDF
4.08 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1277217