Miðeind og máltækni í 10 ár: Frá reglukerfum til gervigreindar

Miðeind is currently celebrating its tenth anniversary. On this occasion, we are publishing a series of articles that look back on the journey and enumerate the progress that has been made in language technology for Icelandic over the past decade.

Language technology is about enabling people to communicate with computers in human languages, and creating various tools that help us work with languages. Here at Miðeind, we have always been guided by the principle of creating and offering solutions to the public that reflect the best available language technology at any given time. To fulfill this goal, we have made it a habit to regularly review our software development methods, which often leads us back to the drawing board, discarding older methods and models for new ones, or radically changing the foundation on which our solutions are built.

Language Technology in Daily Life

Most people have become familiar with traditional language technology tools over time, such as solutions like Google Translate, Púki, Skrambi, or any of the countless chatbots that have taken root on Icelandic websites. Language technology can also touch our lives without us realizing it. We might send a message or email to our bank, and there it first goes through an analysis that assesses how urgent it is — and the results of this analysis affect how quickly we receive a response.

Today's language technology is almost synonymous with artificial intelligence, but it wasn't very long ago that language technology relied almost entirely on rule-based systems, i.e., code that describes step by step the actions that the computer should perform. It's interesting to examine the evolution of language technology from rule-based systems to artificial intelligence in the context of the various proofreading tools that have been available to Icelandic language users over time.

Greynir Analyzes Sentences

The first language technology software developed within Miðeind — in 2015 — is called Greynir and is an example of a rule-based system. Greynir uses an enormous bank of rules to analyze Icelandic text into so-called sentence trees, i.e., a systematic representation of how sentences break down into phrases and what role each word plays in the sentence. All in all, the innermost core of Greynir consists of about 7,000 lines of code (more specifically, grammatical rules in a certain standardized form) which, as you can imagine, took quite a long time to write. Once Greynir was ready, we were able to build on top of it, using it to find specific sentence structures in texts and correct errors in them, for example, the dative substitution tendency.

Proofreading: A New Generation of Grammar Checking for Icelandic

From this emerged the website Yfirlestur (Proofreading), which was revolutionary in the sense that it was the first software capable of correcting Icelandic grammar to any significant degree. Skrambi was mentioned above, but Skrambi is in the same category as Púki, the proofreading function in Microsoft Office, and others that were and are all tools that primarily fix spelling errors and only take grammatical context into account to a limited extent. These tools basically work by looking up each word in a dictionary that contains both dictionary and inflectional forms of a large set of Icelandic words. If a word is not found in the collection, it is marked as an error, and another word form is suggested instead, but otherwise, it is left untouched.

Suggestions for corrections in such systems are based on, among other things, rules about the frequency of certain letter exchanges. Thus, the system marks the word firir and likely suggests fyrir (for) instead, but not fipir or firrir, which are indeed possible word forms in Icelandic. Here, exchanges of i and y are given higher priority than exchanges of one consonant for another, as y-errors are more common among Icelandic language users than many others.

Such tools can correct errors like:

Pétur skaut ifir markið (Peter shot over the goal)

...but do not have the basis to fix a sentence like:

Jón er komin heim (John has come home)

or

Vinkonurnar brynja og lóa búa hlið við hlið (The friends Brynja and Lóa live side by side)

— as long as the word form exists and is reasonably common, it is allowed to stand.

Yfirlestur, on the other hand, was based on sentence analysis, as mentioned before. It could therefore use databases such as the Database of Modern Icelandic Inflection (BÍN) to look up word forms, get information about gender and number, and use them to analyze whether an adjective is correctly inflected and spelled in relation to the noun it accompanies, or whether a subject is in the correct case in relation to the predicate. Yfirlestur therefore easily handled the aforementioned sentence:

Jón er komin heim > Jón er kominn heim (John has come home)

However, the corrections that Yfirlestur could perform were limited to sentences that Greynir managed to parse, but it could, for example, fail with sentences with very complex structure, or those that span across punctuation marks, and if the text was too error-ridden:

Meirihluti stuðningsfólks Samfylkingarinnar, Pírata, Viðreisnar, Vinstri grænna og Sósíalistaflokk Íslands töldu hælisúthlutanir hins vegar vera ónóg. (The majority of supporters of the Social Democratic Alliance, Pirates, Reform, Left-Greens, and the Socialist Party of Iceland, however, considered asylum allocations to be insufficient.)

Stelpurnar fara í leikhús á morgun. Þeim hefur lengi hlakkað til. (The girls are going to the theater tomorrow. They have long been looking forward to it.)

Málfríður: Artificial Intelligence Takes Control

Language technology has advanced tremendously in recent years, and only three years passed between Miðeind's release of Yfirlestur (2020) and the next generation of correction tools for Icelandic (2023). This new technology is entirely based on artificial intelligence, or more specifically deep neural networks, instead of the old rule-based systems. This product is now called Málfríður and is accessible on Málstaður, Miðeind's language processing platform.

The artificial intelligence model that Málfríður is based on has not seen any grammar rules. Instead, it has been fed a substantial amount of parallel texts, on one hand error-ridden text and on the other hand correct text. This way, the model has learned to "translate" text with errors into language that conforms to the most common language standard. This new approach to proofreading has several advantages. Málfríður can grasp and "understand" a much larger context than Yfirlestur and correct errors that are difficult to write rules for, e.g., errors based on word meanings. It can also handle very difficult texts - for example, it is trained on data from people with dyslexia and people who have a mother tongue other than Icelandic.

As seen in this screenshot from Málfríður, it easily handles all the errors that were enumerated above:

Challenges of Artificial Intelligence

The disadvantage of a neural network model like Málfríður — as with most artificial intelligence — is that we don't have full control over the output. When we notice undesirable behavior from Málfríður, i.e., if she doesn't correct something she's supposed to correct or vice versa, we can't simply write an additional rule to fix it. We need to collect training data that demonstrates the correct behavior, train the model on it, and then hope for the best. At Miðeind, we are indeed constantly retraining and improving Málfríður.

Interested readers are directed to the article How does Málfríður work? which delves deeply into the training process, from data collection to fine-tuning, as well as describing how neural networks function.

Vision for the Future

In this article, the spotlight has been particularly focused on proofreading tools, but the development has been similar regarding other language technology. All of Miðeind's main products, such as the speech recognition system Hreimur and the translation engine Erlendur, are now based on artificial intelligence instead of older rule-based systems. It is, after all, the company's goal and purpose to strengthen the position of our language with forward-thinking solutions that reflect the best that language technology has to offer at any given time.

Many are now looking to large language models as the next revolution in the development of language analysis tools. Models such as GPT from OpenAI and Claude from Anthropic have already proven themselves as powerful assistants for all kinds of writing in English. However, these models' proficiency in Icelandic grammar is not yet sufficient to compete with a specialized tool like Málfríður, which uses Miðeind's tailor-made language and correction model for Icelandic. Furthermore, Málfríður has the advantage over large language models in that it is much faster and cheaper to operate, as the Málfríður model is tiny in comparison to the largest language models today. It is likely only a matter of time before this changes, and when it does, Miðeind will, as always, have its finger on the pulse.

Post Tags:
Share this post: