Redefining Red Teaming

Red teaming is a security practice that involves simulating an attack on a system or organization to identify vulnerabilities. The practice has its roots in the military, where it was used to test the defenses of fortifications and other military targets. Red teaming was first used in the corporate world in the 1990s, and it has since become a common practice for organizations of all sizes.

Red teaming can be conducted in a number of ways, but it traditionally involves a team of security experts who attempt to breach the security of a system or organization. The team may use a variety of methods, including social engineering, hacking, and physical product penetration. The goal of red teaming is to identify vulnerabilities that could be exploited by attackers, and to recommend ways to mitigate those vulnerabilities.

With the introduction of generative artificial intelligence (AI), the field is being redefined. At the Generative Red Team (GRT) Challenge held at DEFCON31, we are experimenting with different ways to use the practice of red teaming to include a diversity of voices and to introduce the concept of ‘embedded harms’ to our challenge categories. Our challenge aligns with the Office of Science and Technology Policy ’s (OSTP) AI Bill of Rights and National Institute of Standards and Technology’s (NIST) AI Risk Management Framework.

Humane Intelligence is leading the design component of the challenge. Every week, we host a collaborative working group of design partners - red team experts from each of our model vendors (Anthropic, Cohere, Google DeepMind, Hugging Face, Meta, NVIDIA, OpenAI, Stability AI), our White House sponsors OSTP and NIST, our expert advisor MITRE, and our nonprofit advisors Taraaz, AVID, and Black Tech Street. Together, we have created a set of engaging, educational, and useful challenges. The goal of our work is to demonstrate that at-scale structured public feedback is necessary to create beneficial AI systems.

Most people are familiar with the concept of ‘prompt hacks’ - guidance users can give to trick a model into doing something it’s not supposed to. In the real world, bad actors exist, and it is true that some people want to trick models for inappropriate reasons. However, harmful outcomes can also happen inadvertently - for example the model can hallucinate an incorrect fact, or say something discriminatory or inappropriate about a minority group, or make incorrect assumptions about someone based on their gender.

At the GRT Challenge, we are crafting two sets of challenges. One set is designed around prompt hacks, and the second around embedded harms. ‘Embedded harms’ occur through normal use, rather than adversarial, and include issues such as unintended consequences related to hallucinations, internal inconsistencies, multilingual vulnerabilities, erasure, bias and more. We’ll be sharing our exact categories leading up to DEF CON.

In other words, to win our prompt hack challenges, you have to trick the model, but to win an embedded harms challenge, you have to recreate a normal user scenario that might lead to a bad outcome.

We’re the most excited about learning from our participants. Our prior experience building the Twitter Bias Bounty and Bias Buccaneers showed us how creative people can be. We’re not sure what the GRT Challenge will bring, but we do know that it’ll be unexpected. Looking forward to seeing you at DEF CON AI Village!