Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.
Marchetti, F., Mordan, T., Becattini, F., Seidenari, L., Bimbo, A.D., Alahi, A. (2024). CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting. IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 1-10 [10.1109/tiv.2024.3449046].
CrossFeat: Semantic Cross-modal Attention for Pedestrian Behavior Forecasting
Becattini, Federico;
2024-01-01
Abstract
Forecasting pedestrian behaviors is essential for autonomous vehicles to ensure safety in urban scenarios. Previous works addressed this problem based on motion alone, omitting several additional behavioral cues helping understanding pedestrians' true intentions. We address the problem of forecasting pedestrian actions through joint reasoning about pedestrians' past behaviors and their surrounding environments. For this, we propose a Transformer-based feature fusion approach, where multi-modal inputs about pedestrians and environments are all mapped into a common space, then jointly processed through self and cross-attention mechanisms to take context into account. We also use a semantic segmentation map of the current input frame, rather than the full temporal visual stream, to further focus on semantic reasoning. We experimentally validate and analyze our approach on two benchmarks about pedestrian crossing and Stop&Go motion changes, which rely on several standard self-driving datasets centered around interactions with pedestrians (JAAD, PIE, TITAN), and show that our semantic joint reasoning yields state-of-the-art results.File | Dimensione | Formato | |
---|---|---|---|
CrossFeat_Semantic_Cross-modal_Attention_for_Pedestrian_Behavior_Forecasting.pdf
accesso aperto
Tipologia:
Pre-print
Licenza:
Creative commons
Dimensione
4.08 MB
Formato
Adobe PDF
|
4.08 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11365/1277217