(2025.9.1) MCQFormatBench dataset released

Multiple-Choice Questions (MCQs) are frequently used to evaluate Large Language Models (LLMs). We have released MCQFormatBench, an evaluation dataset designed to assess a model’s robustness to the MCQ format. We categorized the answer process for MCQs into four types and designed eight tasks. By converting 600 questions sampled from the MMLU dataset for each task, we created a dataset of approximately 20,000 questions. For details, please see here.