ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models

IRIS

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark is publicly available.

Lamsiyah, S., Zeinalipour, K., El Amrany, S., Brust, M., Maggini, M., Bouvry, P., et al. (2025). ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4) Proceedings of the Workshop (pp.1-11). Association for Computational Linguistics (ACL).

ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models

Lamsiyah S.;Zeinalipour K.;El Amrany S.;Brust M.;Maggini M.;Bouvry P.;Schommer C.

2025-01-01

Abstract

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark is publicly available.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Codice ISBN
	
				979-8-89176-220-6
			
	Citazione
	
				Lamsiyah, S., Zeinalipour, K., El Amrany, S., Brust, M., Maggini, M., Bouvry, P., et al. (2025). ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4) Proceedings of the Workshop (pp.1-11). Association for Computational Linguistics (ACL).
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2025.wacl-1.pdf accesso aperto Tipologia: PDF editoriale Licenza: Creative commons Dimensione 458.39 kB Formato Adobe PDF Visualizza/Apri	458.39 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1292014