Text processing

What information can be found in the text?

In business operations, a large amount of documents, emails, and other text material is processed, including external data, such as from news and information sources. The text content can be in languages other than Icelandic. Miðeind's artificial intelligence makes it possible to categorize and process text content automatically in various ways and for a diverse purpose.

Sentiment Analysis

Using AI, it is possible to analyze various forms of content, such as emails, customer service chat logs, chatbot inputs, or other text, to determine whether they are positive or negative. This analysis can be scored on a scale, for example, from -1.0 to +1.0. Based on these scores, emails and chat messages can be prioritized to handle the most urgent matters first and route them to the appropriate recipients. Additionally, these measurements can be useful for assessing customer satisfaction.

Text Categorization

By analyzing content, it's possible to automatically determine which category a specific document or query belongs to, such as consumer loans, foreign transactions, corporate consulting, or others, thus simplifying document processing. Additionally, the language of the text can be identified.

Named Entity Recognition

Names of individuals, companies, and locations can be automatically identified within text. This method allows for the addition of context-specific metadata to documents, such as which legal entities, identification numbers, personal names, or addresses appear in a document, regardless of grammatical case. Similarly, personally identifiable information like names, identification numbers, and addresses can be automatically removed from documents through a process known as anonymization.

Text Search and Question Answering

Miðeind's artificial intelligence quickly finds the needle in the haystack, quickly delivering the most likely documents containing the answer, while highlighting the location of the answer within each document. Users can receive a response to a naturally worded query, along with information about where to find it in context for further clarification.

This functionality can benefit customers when searching a website or interacting with chatbots, assist customer service representatives in answering customer inquiries, or aid in internal data processing for companies and institutions.

Text Simplification

It's crucial to communicate information to customers and other stakeholders in the most understandable way possible. Additionally, when addressing children or other groups with limited language proficiency, it's important to tailor the text to their level of comprehension.

AI makes it easy for anyone to "translate" text from a more complex form into a simpler language style (known as text simplification). Miðeind has developed such a "translation engine" for Icelandic. This technology is especially beneficial to those who, for various reasons, struggle with more complex texts, such as individuals with dyslexia.

Text Generation

A significant portion of content generated within companies follows standardized formats. With the help of language models, it's possible to automatically fill out or complete responses to inquiries, automated system emails, forms, contracts, and similar documents, either partially or entirely. This ensures that the generated text is grammatically correct and contextually appropriate, both in terms of content and language.

Miðeind's artificial intelligence can be used in two ways: firstly, to automatically fill in text within templates, and secondly, to generate text from scratch using neural networks based on key information provided. For example, fully or semi-standardized contracts can be automatically generated or filled out, often requiring only a final human review.

What Type of Analyzer (Icelandic: greinir) is that?

Greynir is Miðeind's natural language processing engine. It is designed to work with Icelandic text, handling its complex grammar, including inflections, compound words, and flexible word order. Developed in Python 3, Greynir is compatible with all major operating systems.

Greynir is open source software which is available under an MIT license.

Greynir

Language, both spoken and written, is our primary means of communication as humans and has a reciprocal effect on how we think. For a long time, it has been a sought-after goal to enable computers to communicate with us in natural language. This means that computers should be able to "understand" written and spoken language from humans and respond back with voice or at least properly formatted text.

To work with text in computers, various software tools are needed. Continuous text needs to be divided into sentences; words, numbers, dates, punctuation marks, and other tokens need to be separated; each word needs to be looked up to determine its part of speech and inflectional forms; and the context and position of words in the sentence need to be analyzed to get a picture of what is being said. This allows the computer to discover what is being asked, requested, or stated.

The Greynir parser includes all the main software components needed to work with written Icelandic. It divides text into sentences and tokens, and looks up word forms in the Database of Modern Icelandic Inflections (DMII), which is included in the package. It then uses full constituency parsing to draw up sentence trees that describe the internal structure and composition of the sentences. Once the sentence trees are available, questions, statements, commands, or other information embedded in the text can be extracted.

The voice app Embla is a good example of what can be done based on Greynir. Embla uses Greynir to recognize and correctly understand questions posed in Icelandic. Embla also uses Greynir to ensure that answers are grammatically correct, for example, that noun phrases (such as names of bus stops) are in the correct cases.

Greynir can be used in projects related to various types of information retrieval from text, search engines, text statistics and origin analysis, text review for grammar, language use and style, chatbots, query systems, voice interfaces, sentiment analysis, and more. It can also be used in preparing language corpora used to train deep neural networks.

Greynir's source code and documentation can be accessed on GitHub. Greynir uses the Tokenizer for Icelandic, which is also open-source software from Miðeind. Examples of Greynir's use in a system that queries and reads news can be seen on the website greynir.is.

The technology behind Greynir is discussed in more detail in the paper A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System by Vilhjálmur Þorsteinsson, Hulda Óladóttir and Hrafn Loftsson (Proceedings of Recent Advances in Natural Language Processing, pp. 1397–1404, Varna, Bulgaria, Sep 2–4, 2019).