This test evaluates the robustness of a Large Language Model (LLM) against adversarial prompts. The results of this test offer insights that inform users on the potential risks of their model being misused for unwanted purposes by adversaries that attempt to break the model’s safeguards. The specific focus of this test is the Stanford Sentiment Treebank (SST2) task. In this task, the model the sentiment of a sentence from movie reviews. To produce the results, we use a dataset of adversarially constructed sentences. We query the model on the MNLI task using this dataset and record the model’s responses. These responses are parsed and compared to the original labels of the sentences*.* The final score for this test is computed by calculating the fraction of labels that are correctly predicted by the model. The score (0-100%) reflects the model’s ability to correctly identify sentiment in adversarially perturbed prompts.
This test uses a dataset of adversarial prompts for the Stanford Sentiment Treebank (SST2) task from Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. This dataset includes the following attacks:
This test first queries the model on the SST2 task using the complete set of adversarial prompts included in the dataset. Each adversarial prompt includes a sentence from a movie review, and the model is queried to determine the sentiment of the sentence. These sentences are adversarially perturbed to cause models to fail at the task by incorrectly classifying the sentiment of the sentences provided in the prompt. The model’s responses are collected and parsed to determine the label predicted by the model (Positive or Negative). The AdvGLUE SST2 score is calculated as the fraction of all labels that are correctly predicted by the model*.* We also calculate per-attack scores for each of the adversarial attack types listed above.