How LLMs are Trained and How Red Teaming Helps
Sven Cattell, Founder, AI Village
July 26, 2023

The Problem

To give you a solid idea of the problem something like the Generative Red Team is attempting to address we have to describe how these models are made. There are 3 training phases, unsupervised, supervised fine-tuning, and Reinforcement Learning from Human Feedback (RLHF). The models are created and affected by these steps in different ways. The final models may be pretty good but are still unsafe in many ways that are hard to find, label, and add back into the system to make them better. Some vendors address this by adding additional layers that filter prompts and generations for harmful content, or rewrite responses to make them better. 

Dataset Gathering and Cleaning

The first step is to gather a sufficiently large dataset for the first phase of training. Bigger models have interesting emergent behavior, but require ever more data to train. This is because of the scaling laws that these models seem to follow. A model consisting of many billions of parameters requires trillions of tokens to train. The requirement for vast amounts of data means that labs building these models need to use vast datasets, like common crawl

The raw datasets are repetitive and quite dirty. They contain racist, misogynistic, and homophobic speech. They also contain milder socioeconomic biases that are harder to detect. Basically, all the historical biases of the humans and systems that contribute to the internet. Models trained on these unfiltered datasets are very good at reproducing this harmful content, which may be disclosed in model cards if it has been uncovered. 

So, before we can train a large language model we have to clean the raw datasets. The cleaned common crawl used to train OpenAI’s GPT3 had 499 billion tokens, of which 410 billion came from common crawl. OpenAI filtered the dataset down from 45TB compressed down to 0.6TB. Most of this is deduplication as diversity in the input data is key to training effective LLMs. The creators also remove as much harmful content as possible. However, they most likely didn’t get everything. The raw common crawl has tens of trillions of tokens. If there were only a few billion tokens with bad content, we could remove 99.9% of those and still have millions of tokens. That’s more than enough for these models to learn from

Model creators also used “clean” datasets in their training, like Wikipedia, but these are tiny in comparison to the scraped internet. Another problem is that these models may just be “creative” enough that even with “perfectly clean” datasets they might just make up problematic content by piecing together disparate concepts in a prompt. Datasets like the Pile use only “clean” data from sources that have editorial review, but models trained off the pile also demonstrate significant issues.

Unsupervised Learning

Once we have the cleanest dataset possible of sufficient size we can move onto training. The first training step is unsupervised and highly scalable, and is the primary step in training LLMs. All it does is a simple task like predict the next word of a text. This phase is where the LLM acquires the ability to parse sentences, and embeds knowledge in its parameters. To predict the next token in a truncated article about Abraham Lincoln the LLM will have to memorize some aspects of the civil war and the emancipation proclamation. They memorize training data to the extent that they can repeat social security and telephone numbers that were not cleaned from the dataset.

Supervised Learning

Once one has a foundational model, one finetunes this model on supervised datasets to enable the model to perform useful tasks including summarization, paraphrasing, and acting like a helpful chatbot. An example is OpenAssistant’s dataset they gathered from user interactions on their data gathering site. This supervised fine tuning is where it learns about control tokens used by the code that converts the many parts of a conversation to a single string to query the model. Fundamentally the models only produce the next token. They don’t know when to stop, or even what a conversation is, they just add another token to the end of an array of tokens. To create a chat it needs to learn to distinguish between a system prompt, many user prompts, and its own previous responses. For OpenAssistant’s llama model these look like “<|system|>system message</s><|prompter|>user prompt</s><|assistant|>”. Each model has its own unique way of denoting the message structure, and it usually works like HTML, with tokens that act like opening and closing tags.


The model trained with supervised finetuning creates a product like PaLM or Claude, but it still suffers from the issues with the foundational model. If you ask it for harmful content it will produce it. In security terms these models are unsecured databases that have no controls. They will answer any query, such as telling you how to build a bomb, or repeating sexist content that was in the unsupervised pile of training data. 

The final stage of training these models is adding the ‘security layer’. The main safety mechanism used is Reinforcement Learning from Human Feedback, RLHF. This takes techniques from reinforcement learning to steer the model away from harmful content. The vendor takes known harmful responses and ‘punishes’ the model for making that response. The thumbs up and thumbs down button on the side of most of these chatbot’s UI are there to gather some human feedback data for the reinforcement learning, but most of this is done with trusted people. Microsoft’s Tay took untrusted data from Twitter and turned into a fairly harmful application within a day of deployment. 

Other Safeties

There are two ways of doing this. The simple way is a moderation filter that blocks content. This can be with a ML model or based on keywords. If you ask OpenAI’s Dalle2 for a celebrity a filter will block the output, probably because of blocked keywords. It also probably uses a model based filter like the one included in the StableDiffusion model (which was promptly red teamed). An LLM example of additional filtering and safety mechanisms is NVIDIA’s Guardrails. Another is Lakera’s filter used in the Gandalf app. Both use machine learning models to analyze the inputs to see if they contain prompt injections or requests for harmful content, and the outputs to see if they contain personally identifiable information or harmful content. The idea of finding all the weird ways to ask the model to misbehave is impossible to do. You can only train on samples you have and with modifications you can think of. Attackers can just do something you didn’t think of and usually find a bypass. Teams that use a filter like this are starting a security detection game of wack-a-mole that has been played out in spam and malware. This can be an effective solution, but it’s a sisyphean task.

Final Steps

A combination of a primary model that has undergone RLHF and perhaps has a few safety filters forms a system that should be safe as a whole. To test this safety, it needs to be red teamed. However, there are only a few people that do this testing. The largest reported effort was Anthropic red teamed their language models with a team of 111 people. This was excellent work in this space. However, this needs to be done at a much larger scale for these models to really be safe. People are already doing it online. The DAN jailbreak for ChatGPT, was built by a forum community. However, finding these prompt injections in the deluge of normal prompts that a vendor of such a model receives is difficult.

The scope of all human language is too vast for only a few people to catch everything. There are a vast number of cultures and languages that aren’t included in the tiny percentage of people involved in creating and maintaining language models. The evaluation focus is on English with usually American biases, so harms in other languages might be missed. People from communities that aren’t usually involved in the model creation process can usually find prompts that cause subtle harms. Rolling these findings back into the model training and evaluation process is key to making safer models.