New! Try Málstaður, a platform for all of Miðeind's main products.

Language Models: A Comprehensive Overview

This article is based on various lectures and presentations given by Miðeind staff in recent months. The content covers various topics, but the main focus is on explaining what large language models are and how they can be utilized. Large language models like GPT-4 and image models like Midjourney have revolutionized people's expectations of artificial intelligence since they emerged. They offer tremendous potential for simplifying workflows but also come with new challenges.

What Are Language Models?

Neural network models can be trained on various types of input, such as text, images, sound, or a combination of all of these. Here, we mainly discuss language models that have been trained on text. In short, language models are mathematical models that take in text and output a probability distribution over a language's vocabulary. Several types of language models have become established in recent years and are called foundation models. They are trained to continue text or fill in blanks, as shown in examples (1) and (2):

(1) The Minister of Finance presented the budget bill to <?> (the model predicts the next word)

(2) The Minister of Finance presented <?> to Parliament yesterday. (the model fills in the blank)

To solve such tasks to any extent, the model needs a basic understanding of the language, both in terms of content and syntax. To solve tasks even better, the model additionally needs to possess certain world knowledge. These abilities are not fed into the model but rather the model needs to learn this through numerous examples.

Sequence-to-sequence models are another type of base model. They are trained to learn a mapping from input to output, as shown in example (3):

(3) <is>Sólin mun skína á morgun. (input) <en>The sun will shine tomorrow. (output)

The pairs can be sentences in different languages for translation, original and corrected text for grammar correction, or an image and appropriate text description. The most important thing is that the same information is present in both input and output; otherwise, the task is poorly defined and the model's behavior is unpredictable.

When models become large enough and are trained on sufficient data, they begin to show the ability to solve tasks without having been specifically trained in them. Let's take an example of a generative model, like (1) above. If we give the model the text "Question: Who is the President of Iceland? Answer: ", the model might predict that the next word is "Halla". The ability to predict "correctly" increases with model size, data quantity, and data quality. We could also give the model more "examples," which would then be more questions with answers (few-shot) instead of requiring the model to solve the question-answering task without examples (zero-shot). This is not training, i.e., the model was trained initially but is then guided with examples during prompting.

However, enormous computational power is needed to base-train large models, even entire data centers. After base training, the models can be fine-tuned, run, and hosted at much lower cost. Various attempts have been made to shrink models so they can be run on cheaper hardware. The danger is that the models can lose properties and perform worse if the shrinking is too aggressive. It is therefore a balancing act between cost and performance.

There are several Icelandic language models, but none on the scale of GPT-4. IceBERT, a BERT model for Icelandic, can classify text but not generate new text, like in (2). mBART-enis is a sequence-to-sequence model like (3) and is the foundation of velthyding.is. The ByT5 sequence-to-sequence model is used in grammar correction to map text to correct Icelandic and is the foundation of Málfríður.

On Large Language Models

Most large language models are trained on texts accessible on the internet and from special datasets, which are mostly in English unless special emphasis is placed on more languages. The models are also usually fine-tuned to produce well-formed output. In the case of GPT-4, reinforcement learning with human feedback (RLHF) is used, which teaches the language model to understand questions and tasks and answer them correctly and well. During the development of GPT-4, OpenAI, in collaboration with Miðeind, for the first time experimented with training GPT with RLHF in a language other than English, namely Icelandic.

The collaboration between Miðeind and OpenAI began following a visit by the President of Iceland and a delegation to OpenAI headquarters in May 2022. Among the participants was the founder of Miðeind, and at his initiative, discussions began between the two companies on how Icelandic could be used by OpenAI as a model or template for supporting smaller languages in large language models. The first phase of the collaboration project involved teaching GPT-3 Icelandic through fine-tuning and assessing how much text data in a particular language is needed to teach a large language model the language. OpenAI provided computing power and access to experts, while Miðeind provided text data and work.

When preparations for GPT-4 began in the fall of 2022, OpenAI approached Miðeind to participate in the training with RLHF. Miðeind gathered a group of nearly 40 volunteers. They were tasked with creating questions and tasks in Icelandic for GPT-4, and then evaluating the model's responses, grading them, and teaching it to answer even better. The data was used in training GPT-4 and resulted in the model improving its understanding of questions and answering in Icelandic. Now the model answers almost exclusively in Icelandic, whereas before, answers in other languages would occasionally slip in. The model now understands Icelandic well but has more difficulty with generation, so the project is far from finished.

GPT-4 is just one model, and numerous others have emerged in recent months. These include BLOOM, LLaMA, OPT, GLM, Dolly-v2, and GPT-SW3, a large Scandinavian language model from AI Sweden that has been specifically trained on Icelandic. New models appear every week, so the list is by no means exhaustive. There's no need to put all eggs in one basket and rely on third parties who don't necessarily have the interests of Icelandic at heart. It is therefore important for Iceland to set a clear policy regarding artificial intelligence.

Why are large language models so interesting? Before the era of large language models, it was very time-consuming to create models that solved tasks. This involved data collection, data labeling, training (which requires specialized hardware), operation, maintenance, and implementation for the user. The process could be long, expensive, and require many iterations. Large language models can solve numerous tasks without training and data collection, making most previous steps unnecessary. Instead, the task is described to the model, 1-2 examples of good solutions are given, and then the model tries its hand at real examples. Large language models thus make various language technology solutions more accessible to people and businesses. However, it should be noted that large language models are not the solution to all problems. Therefore, it is still necessary to assess whether a large language model is the right choice or part of an overall solution.

Utilization of Large Language Models

Large language models are versatile, and we see innovative use cases every day. However, it's important to keep in mind what they can (and cannot) do. It's good to view the query as a variant of programming. We are designing instructions for the model, so all necessary information and conditions are needed. It can help to include examples of the intended output to ensure we get what we think we're asking for. We shouldn't forget that the models can maintain context between queries, so it's possible to ask for another version of the answer if something is missing. The models can thus be viewed as diligent but (sometimes) quick-working interns.

The models have their limitations, which are necessary to be aware of. We need to be conscious that the models are trained on data, but after training, the models don't learn anything new and don't know about events after training. The models are retrained regularly, but as mentioned before, this is extremely costly. The models are trained to predict what comes next, so they tend to create facts (hallucinations) and are not reliable in complex reasoning. If they are to answer from a knowledge base, the models need to be strictly constrained to answer only from there. The models are also service-minded and trust users too much. If a user incorrectly corrects a model, they want to please the user and accept it as true.

Humans are flawed, and our prejudices manifest in what we write, even if unconsciously. Models are trained on content from humans, so they inherit the built-in biases found in the texts. Model bias is an extremely popular research topic, as there is much to gain in this area, but no tangible results have been achieved yet.

Generally, language models are used for text analysis and processing, question answering, and as all-purpose assistants. Language models can write text in different styles, in different languages, and on various subjects. By connecting to databases, questions can be answered by searching for related content in the database. This saves employee time and spares the user from navigating through a jungle of detailed websites related to the topic. Language models are also useful in internal document processing, where, for example, summaries of long regulations can be created to facilitate an overview of complex material. It's possible to prepare for the use of language models and examine what data is available.

Ethical Considerations

To ensure responsible use of language models, it's necessary to think things through before diving in.

Bias must be kept in mind so that the model doesn't make decisions that exacerbate biases from the training data. For instance, there are stories out there about artificial intelligence being used to evaluate job applications, where the model is fed older job applications and hiring information that contains strong biases, including those related to gender.

The European Union has gone the furthest in formulating policies on the use of artificial intelligence. They are considering requiring parties to disclose when a machine makes decisions about people's affairs. People would then have the right to appeal and request that a human confirm the model's conclusions. Potential uses of artificial intelligence are also classified into risk categories, with a proposal to completely ban the most risky category.

Various issues are also related to the training process and the data used. The main focus is on the transparent origin of training data, so that it should be clear what data is used to train models. Data protection considerations when selecting data and where the data goes when using models are also worth mentioning. Image models trained on various online image collections have caused fierce debates. Artists have protested the use of their content to train models that are then supposed to do their jobs, without them receiving any compensation. Similar disputes have arisen over the right to train on texts collected from the internet. This article won't resolve these issues, but it's necessary to be aware of different opinions and the resulting legislation.

Conclusion

Artificial intelligence will affect the jobs of many, and this development has already begun. Major societal changes are on the horizon, which can cause anxiety, but we must not forget the enormous benefits that this development brings. Artificial intelligence will be useful in countless areas that are difficult to foresee. Prototypes for controlling prosthetic limbs with voice control ("Pick up the coffee cup on the table") have already emerged. The Be My Eyes app is a good example of accessibility support using artificial intelligence. Using the app, blind and visually impaired users can connect with sighted volunteers in the app, show them their surroundings, and find out, for example, where they left their glasses. Many find it uncomfortable to show strangers their private life and dirty clothes on the floor, so a special vision-enabled version of GPT-4 was experimentally connected. The model then receives a question from the user and image input and answers where the glasses ended up. The possibilities are great, but we must tread carefully and responsibly.

Post Tags:
Share this post: