Paper Overview
Evaluating aligned large language models’ (LLMs) ability to recognize and reject unsafe user requests is
crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations
that we address with 🥺SORRY-Bench, our proposed benchmark to systematically evaluate LLM safety
refusal behaviors.
📊First, we point out prior datasets are often built upon course-grained and varied safety categories, and that they are overrepresenting certain fine-grained categories.
For example,
SimpleSafetyTest includes broad categories like “Illegal Items” in their taxonomy, while
MaliciousInstruct use more fine-grained subcategories like “Theft” and “Illegal Drug Use”. Meanwhile, both of them fail to capture certain risky topics, e.g., “Legal Advice” or “Political Campaigning”, which are adopted in some other work (e.g.,
HEx-PHI).
Moreover, we find these prior datasets are often
🚨imbalanced and result in over-representation of some fine-grained categories. As illustrated in the Figure below, as a whole, these prior datasets tend to skew towards certain safety categories (e.g., “Fraud”, “Sexual Explicit Content”, and “Social Stereotypes”) with “Self-Harm” being nearly 3x less represented than these categories. However, these other underrepresented categories (e.g., “Personal Identifiable Information Violations”, “Self-Harm”, and “Animal-related Crimes”) cannot be overlooked – failure to evaluate and ensure model safety in these categories can lead to outcomes as severe as those in the more prevalent categories.
Imbalanced data point distribution of 10 prior datasets (§2.2) on our 44-class taxonomy.
To bridge this gap, we present a
fine-grained 44-class safety taxonomy (shown below; see §2.2 of our paper) across 4 high-level domains. We curate this taxonomy to unify the disparate taxonomies from prior work, employing a human-in-the-loop procedure for refinement – where we map data points from previous datasets to our taxonomy and iteratively identify any uncovered safety categories. Our resultant taxonomy captures diverse topics that could lead to potentially unsafe LLM responses, and allows stakeholders to evaluate LLM safety refusal on any of these risky topics at a more granular level.
On top of this 44-class taxonomy, we craft a
class-balanced LLM safety refusal evaluation dataset (§2.3). Our base dataset consists of
440 unsafe instructions in total, with additional manually created novel data points to ensure equal coverage across the 44 safety categories (10 per category).
Taxonomy of SORRY-Bench.
💬Second, we ensure balance not just over topics, but over linguistic characteristics. Existing safety evaluations often fail to capture different formatting and linguistic features in user inputs. For example, all unsafe prompts from
AdvBench are phrased as
imperative instructions, whereas
Bianchi et al. (2024) note that unsafe instructions phrased in interrogative questions can lead to discrepant safety performance of LLMs. Not explicitly considering these linguistic characteristics and formatting can result in
🚨over-representation (of a given writing style, language, dialect, etc.), too.
Demonstration of 20 linguistic mutations we apply to our base dataset.
We address this by considering 20 diverse linguistic mutations that real-world users might apply to phrase their unsafe prompts (Figure above; see §2.4 of our paper). These include rephrasing our dataset according to different writing styles (e.g., interrogative questions, misspellings, slang) and persuasion techniques (e.g., logical appeal), or transforming the unsafe instructions with encoding and encryption strategies (e.g., ASCII and Caesar) and into multi-languages (e.g., Tamil, French). After paraphrasing each unsafe instruction (written in imperative instruction style) of our base SORRY-Bench dataset via these mutations, we obtain 8.8K additional unsafe instructions.
🤖Third, we investigate what design choices make a fast and accurate safety benchmark evaluator, a trade-off that prior work has not so systematically examined. To benchmark safety behaviors, we need an efficient and accurate evaluator to decide whether a LLM response is in fulfillment or refusal of each unsafe instruction from our SORRY-Bench dataset. By far, a common practice is to leverage LLMs themselves for automating such safety evaluations. With many different implementations of LLMs-as-a-judge, there has not been a large-scale systematic study of which design choices are better, in terms of the tradeoff between efficiency and accuracy. We collect a large-scale human safety judgment dataset (§3.2) of over 7K annotations, and conduct a thorough meta-evaluation (§3.3) of different safety evaluators on top of it.
Meta-evaluation results of different LLM judge design choices on SORRY-Bench.
As shown, above, our finding suggests that small (7B) LLMs, when fine-tuned on sufficient human annotations, can achieve 🎯satisfactory accuracy (over 80% human agreement), comparable with and even surpassing larger scale LLMs (e.g., GPT-4o). Adopting these fine-tuned small-scale LLMs as the safety refusal evaluator comes at a ⚡️low computational cost, only ~10s per evaluation pass on a single A100 GPU. This further enables our massive evaluation (§4) on SORRY-Bench, which necessitates hundreds of evaluation passes, in a scalable manner.
Putting these together, we evaluate
over 50 proprietary and open-source LLMs on SORRY-Bench (as shown in our
demo), analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of
LLMs’ safety refusal capabilities, in a balanced, granular, and efficient way.