We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.

Zeinalipour, K., Jamshidi, N., Akbari, F., Maggini, M., Bianchini, M., Gori, M. (2025). PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian. In Proceedings of the First Workshop on Language Models for Low-Resource Languages (pp.344-372). Association for Computational Linguistics.

PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian

Kamyar Zeinalipour;Neda Jamshidi;Marco Maggini;Monica Bianchini;Marco Gori
2025-01-01

Abstract

We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.
2025
979-8-89176-215-2
Zeinalipour, K., Jamshidi, N., Akbari, F., Maggini, M., Bianchini, M., Gori, M. (2025). PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian. In Proceedings of the First Workshop on Language Models for Low-Resource Languages (pp.344-372). Association for Computational Linguistics.
File in questo prodotto:
File Dimensione Formato  
2025.loreslm-1.27.pdf

accesso aperto

Tipologia: PDF editoriale
Licenza: PUBBLICO - Pubblico con Copyright
Dimensione 872.81 kB
Formato Adobe PDF
872.81 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/1288034