An introduction to mitigating toxicity in LLMs

Deploying LLMs in production requires more than strong benchmarks. Here's a grounded introduction to the challenge of toxicity and how to start addressing it.

A vital aspect of “productionalizing” Large Language Models (LLMs) is to ensure a safe user experience and promote alignment with societal standards. LLMs excel at generating language and answering user-input questions, even some of the hardest ones. However, when given a malicious prompt, LLMs may also try to answer that with an unsafe response.

Safety is an umbrella term, encapsulating some of the following categories:

  • Bias: Societal biases found within the training data

  • Misinformation: False data passed in during training with product misinformation on the output

  • Toxicity: Offensive responses that can be unethical, hateful, or derogatory

The intention behind this blog post is to focus on the last category, toxicity, and describe possible approaches to mitigating toxicity from reaching the end-user. Ultimately, the creator of the LLM can and should ensure best practices when it comes to safety.

Below is a quick example of how easy it is to trigger an unethical response from an LLM:

* example directly from AttaQ prompt with Mosaic/MPT-7b model

When it comes to mitigating toxicity (and ensuring a safer model experience), there are some general categories in the solution space:

  • Data: Filtering for high quality data to train the model

  • Training: Solving the problem at the root cause with fine tuning and alignment

  • Human-in-the-Loop Monitoring: Integrating a human to evaluate responses from LLMs before they reach the end-user

  • Auxiliary: Tools that proactively or reactively mitigate either toxic input to the LLM, or mask toxic output from the LLM to the end-user

There’s no de facto industry standard for mitigating toxicity at the moment as it’s a constantly evolving problem. Depending on the scope of the LLM and its application, the efficacy of the solution is going to correlate with the locations and number of guardrails in place. There are various costs associated with each of these approaches, some higher (implementation effort, app latency, unknown results, etc.), depending on which solution is implemented.

Data

When considering purifying or filtering data, the effort linearly grows with the amount of data points in a data set. Often models will be partially trained on internet data, and as we know internet data isn’t always the safest. If the data set is not already pre-processed, then the responsibility lies with the LLM creator to determine the best approach for scrubbing the data of toxic content. The larger the data set, the more work and time it will entail to filter out toxic content. The data filtering approaches themselves vary with time and effort — whether using filter words or a classifier, it’s on the creator to determine what best fits their needs. The trade-off with this solution is longer processing times for higher quality or safer data.

Training

When it comes to adjusting the model training, there are multiple approaches such as fine tuning, alignment training or even novel approaches such as Adversarial DPO. In a general sense, trying to address the problem via training would be trying to solve the problem at the root cause. If you can train the model to prevent generating toxic responses, then as a creator you wouldn’t need to put a lot of emphasis on data filtering. However, this comes with a trade-off that the machine learning space constantly encounters: the correlation between the impact training methods isn’t often known. At best, it is a guess, and requires that the creator implement the training method and monitor results over time to ensure the results match expectations. Thus it is still on the builder of the model to implement and verify these results, which is another trade-off of time and uncertainty. As the model grows in size, the longer it takes to train the model and verify the results.

Human-in-the-Loop

Human-in-the-loop (HITL) monitoring means that humans aid the models in returning “correct responses” by being a part of any step in the machine learning (ML) pipeline. That can happen at model training, deployment or inference time. They are able to control whether those responses are delivered to the user or cut off the prompt or response before it gets to the model and user, respectively. Depending on the number of users accessing the LLM, this may be a feasible solution. However, as the number of users grows, this won’t suffice due to the amount of human intervention required. Imagine how many humans would be required for ChatGPT using HITL to evaluate every input and response given that they handle millions of requests daily! Although this solution would require no modification to data pre-processing or during training, it would take a lot of effort to monitor every response coming from the model, increasing latency by an indeterminate amount of time.

Image capturing HITL / Auxiliary tool pattern

Auxiliary

Auxiliary tools offer similar tradeoffs as the HITL approach without the need for human eyes to evaluate every input or response that comes from the LLM. Auxiliary tools require a low level of effort when it comes to integration with a model and other infrastructure (such as an inference server), because it injects a python package in between the user and application interactions with the model — a flexible approach without requiring lengthy model training or data changes. The added bonus of auxiliary tools is that sometimes they offer continuous support staff and documentation behind the tools given that they are trying to deliver a product as well. Finally, auxiliary tools are meant to check inputs or responses thus there’s no need to tie into the data pre-processing approaches, allowing for quick experimentation and re-use of existing models. Of the solutions covered, we’ve found auxiliary tools more helpful because of the lower lift in effort, flexibility to experiment and available support and documentation and worth exploring to improve our existing product within a short period of time.

Interested in learning more? Stay tuned. In future posts, we will evaluate different auxiliary tools, and share our experience with each of them and provide some general takeaways.

Previous
Previous

How Kensho Collab Week strengthens our culture and accelerates S&P Global’s AI strategy

Next
Next

What is the future of generative AI beyond chat interfaces?