(2025.9.1) MCQFormatBench dataset released

Multiple-Choice Questions (MCQs) are frequently used to evaluate Large Language Models (LLMs). We have released MCQFormatBench, an evaluation dataset designed to assess a model’s robustness to the MCQ format. We categorized the answer process for MCQs into four types and designed eight tasks. By converting 600 questions sampled from the MMLU dataset for each task,… Continue reading (2025.9.1) MCQFormatBench dataset released

Published
Categorized as 未分類