Miðeind monitors and measures the performance of the latest artificial intelligence models in the Icelandic language, as explained in more detail in another article on our website. We have developed and collected various benchmarks that examine, among other things, the models' proficiency in Icelandic grammar, reasoning in Icelandic, and their knowledge of facts about Icelandic history and culture. Until now, models from OpenAI and Anthropic have monopolized the top positions.
But now (as of late April 2025) a new leader has emerged on our list. Google Gemini 2.5 Pro takes the top spot from OpenAI's o1-preview, with a 2.3-point lead in average score, which is an unusually large difference between the top positions.
The top positions on Miðeind's latest leaderboard, which can be seen in full on the Hugging Face website.
The Gemini model performs well in, among other things, adjective and noun agreement, and in the Belebele reading comprehension test. It also achieves impressive results (52.7%) compared to its competitors on the WikiQA-IS test, which asks various (difficult) questions about factual knowledge of Icelandic society, history, and culture, based on the Icelandic Wikipedia.
Google is thus making a strong entry into the field of generative AI models — and it's pleasing to see that Icelandic is certainly not being left out there.
In general, it can be said that these benchmarks are becoming too easy for the best AI models, except for the WikiQA-IS test. In other words, the measurements are becoming saturated; the performance is well above 90% for most models, and therefore the tests no longer detect significant differences. And the flip side is that in some of the benchmarks, one might doubt whether the typical human Icelandic speaker would score higher than the language models; for example, it would be interesting to know if they would achieve over 90% on the noun/adjective agreement benchmark — as Google Gemini does — where rare nouns and adjectives are inflected together in all four cases, singular and plural.
Example from Miðeind's noun/adjective agreement benchmark: Inflect "framhvass lagarefur" ("front sharp-edged cunning lawyer" - a borderline nonsensical phrase) in all cases, singular and plural.
The agreement test deliberately uses rare words and their combinations, so that the models can less easily resort to copying training examples and texts they have already seen, but rather need to rely on their "language intuition" for Icelandic and knowledge of the inflection rules and patterns of the language.
The WikiQA-IS benchmark is still challenging for AI models, as many of the questions are quite tricky. However, all the answers can be found in the Icelandic Wikipedia, as both the questions and the answers are collected and created from there; in fact, they are collected and created with the help of artificial intelligence. Here's an example:
Hvaða íslenski matur er hefðbundinn á sprengidaginn?
Saltkjöt og baunir.
A typical question and answer from the WikiQA-IS benchmark test. Q: "What Icelandic food is traditional on Mardi Gras?"; A: "Salted meat and split pea soup."
Google Gemini 2.5 Pro is the only model that achieves over 50% correct answers on the WikiQA-IS benchmark (specifically 52.7%), with Claude Sonnet 3.7 and OpenAI o1-preview following with about 45%.
Reliable benchmarks for AI performance in Icelandic are important. They not only help Icelandic users choose the best models for their needs but also guide major technology developers in improving their models' Icelandic capabilities. After all, it's difficult to improve what can't be measured! Therefore, as AI technology advances, continuously developing updated and more challenging benchmark tests is crucial — and Miðeind remains committed to driving progress in this area.