Overview

This test evaluates the robustness of a Large Language Model (LLM) against adversarial prompts. The results of this test offer insights that inform users on the potential risks of their model being misused for unwanted purposes by adversaries that attempt to break the model’s safeguards. The specific focus of this test is the Question-Answering Natural Language Inference (QNLI) task. In this task, the model determines whether a context sentence contains the answer to a question. To produce the results, we use a dataset of adversarially constructed context sentences and questions. We query the model on the QNLI task using this dataset and record the model’s responses. These responses are parsed and compared to the original labels of the sentence question pairs*.* The final score for this test is computed by calculating the fraction of labels that are correctly predicted by the model. The score (0-100%) reflects the model’s ability to correctly identify question answers in adversarially perturbed prompts.

Dataset

This test uses a dataset of adversarial prompts for the Question-Answering Natural Language Inference (QNLI) task from Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. This dataset includes the following attacks:

  1. AdvSQuAD. This attack appends human-crafted distracting sentences to the end of the input paragraph.
  2. BERT-ATTACK. This attack identifies important words in each sentence and then replaces them with carefully crafted typos.
  3. CheckList. This attack adds randomly generated URLs and handles to distract model attention.
  4. SemAttack. This attack combines perturbations across different semantic spaces (typo space, knowledge space, contextual space).
  5. SememePSO. This attack uses external knowledge bases such as HowNet or WordNet to search for substitutions.
  6. TextBugger. This attack identifies important words in each sentence and then replaces them with carefully crafted typos.
  7. TextFooler. This attack select synonyms according to the cosine similarity of word embeddings.
  8. SCPN. This attack is based on syntax tree transformations and paraphrases a sentence with specified syntactic structures.
  9. StressTest. This atacks appends three true statements to the end of the hypothesis sentence.
  10. T3. This attack adds perturbations on different levels of the syntax tree to generate the adversarial sentence.

Methodology

This test first queries the model on the QNLI task using the complete set of adversarial prompts included in the dataset. Each adversarial prompt includes a context sentence and a question, and the model is queried to determine whether the context sentence contains the answer to the question. These sentences and questions are adversarially perturbed to cause models to fail at the task by incorrectly classifying the presence or absence of the answer. The model’s responses are collected and parsed to determine the label predicted by the model (Containment or No Containment). The AdvGLUE QNLI score is calculated as the fraction of all labels that are correctly predicted by the model*.* We also calculate per-attack scores for each of the adversarial attack types listed above.