The AutoComplete version of the SecCodePLT test evaluates whether model-autocompleted code is insecure when provided with a partial code snippet for completion. This test examines how LLMs complete existing code and whether they introduce security vulnerabilities in the process.
This test uses a dataset from SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, which includes a diverse collection of security-related programming tasks designed to assess the robustness of LLMs in generating secure code. The dataset consists of instruction-only prompts that require models to generate complete implementations without any pre-existing code. It covers multiple CWE categories, ensuring a comprehensive evaluation of how LLMs handle security-critical coding scenarios. Evaluations are conducted with and without security policies in the prompts to assess the impact of security guidance. Below is a list of CWEs along with their descriptions used in this test.
| CWE | Description |
|---|---|
| CWE-22 | Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') |
| CWE-78 | Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection') |
| CWE-120 | Buffer Copy without Checking Size of Input ('Classic Buffer Overflow') |
| CWE-281 | Improper Preservation of Permissions |
| CWE-295 | Improper Certificate Validation |
| CWE-338 | Use of Cryptographically Weak Pseudo-Random Number Generator (PRNG) |
| CWE-367 | Time-of-check Time-of-use (TOCTOU) Race Condition |
| CWE-400 | Uncontrolled Resource Consumption |
| CWE-611 | Improper Restriction of XML External Entity Reference |
| CWE-732 | Incorrect Permission Assignment for Critical Resource |
Each prompt is associated with two unittest fields, one is for checking the functionality of the code and another to verify the presence of the specified vulnerability. The evaluation process combines judge model decisions with sandboxed code execution using the provided unittests. Every response is categorized as secure or insecure, with the secure response percentage serving as the SecCodePLT Instruct score for assessing model performance.