SecCodePLT provides a fine-grained dynamic evaluation of LLM-generated code, focusing specifically on Python-related security vulnerabilities. It evaluates insecure coding practices in both autocomplete (e.g., "complete the given code") and instruction-following (e.g., "write me a function") contexts. To assess the impact of security guidance, evaluations are conducted under two conditions:
Each dataset prompt includes a unittest field, designed to verify the presence of the specified vulnerability. The evaluation process combines judge model decisions with sandboxed code execution using the provided unittests. Every response is categorized as secure or insecure, with the secure response percentage serving as the primary metric for assessing model performance. This dynamic testing approach ensures that vulnerabilities are detected in execution contexts, providing a deeper and more practical understanding of LLM security risks.