FEAT Add LlamaGuard scorer for safety classification of model outputs

## What this proposes

A new scorer that wraps Meta's LlamaGuard models (LlamaGuard-7B / Llama-Guard-3-8B / Llama-Guard-3-1B) and returns a `safe` / `unsafe` verdict on a message, plus the list of violated safety categories (S1–S14 in the MLCommons taxonomy) attached to the score as metadata.

The scorer talks to LlamaGuard through any `PromptChatTarget`, so users can point it at HuggingFace Inference, Together, Groq, Fireworks, a local vLLM/TGI deployment, or any other OpenAI-compatible endpoint. No local `transformers` / `torch` dependency.

## Why

LlamaGuard is one of the most widely used open-weight safety classifiers for LLM red-teaming work (Llama-Guard-3, Aug 2024). Today PyRIT has no built-in way to use it:

- `pyrit/score/` has no scorer that wraps an externally-hosted safety classifier. The closest existing scorers are `azure_content_filter_scorer.py` (Azure-specific) and `prompt_shield_scorer.py` (Microsoft Prompt Shields).
- PyRIT already loads the HarmBench dataset (`pyrit/datasets/seed_datasets/remote/harmbench_dataset.py`), but scoring HarmBench responses today goes through generic Likert / self-ask scorers rather than a purpose-built safety classifier.
- A clean LlamaGuard scorer also gives PyRIT a reusable pattern for future safety-classifier wrappers (ShieldGemma, WildGuard, the HarmBench-paper classifier, etc.) without each one inventing its own abstraction.

## Proposed design

**File:** `pyrit/score/true_false/llamaguard_scorer.py`
**Class:** `LlamaGuardScorer(TrueFalseScorer)`

**Constructor:**

```python
LlamaGuardScorer(
    chat_target: PromptChatTarget,
    *,
    model_version: Literal["llamaguard-7b", "llamaguard-3-8b", "llamaguard-3-1b"] = "llamaguard-3-8b",
    categories: Optional[list[str]] = None,   # default = official taxonomy for the chosen version
    validator: Optional[ScorerPromptValidator] = None,
)
```

**Behavior:**

1. Format the message into the LlamaGuard prompt using the official chat template for the selected version (the original `[INST]` format for v1/v2, the new chat-template format for v3).
2. Send the formatted prompt to `chat_target.send_prompt_async(...)`.
3. Parse the response:
   - First line `"safe"` → `score_value=False`.
   - First line `"unsafe"` → `score_value=True`; second line holds the violated category codes (e.g. `S1, S6`) which go into `Score.score_metadata["violated_categories"]`.
   - Anything else → handled as a bad output (logged, raises a clear error or marks the score with an explanation).
4. The raw classifier output is preserved in the score's rationale string for auditability.

**Test pattern:** modeled on `tests/unit/score/test_self_ask_general_true_false_scorer.py`. Mock the chat target, exercise the three response shapes (safe / unsafe / bad output), and verify the metadata round-trip.

## Open design questions

I'd like maintainer input on a few choices before writing code:

1. **Base class.** `TrueFalseScorer` with categories on `score_metadata` feels natural for a binary verdict. Would you prefer the violated categories surfaced as separate `Score` objects (one per category), so `score_aggregator` can fold them in?
2. **Default model version.** `llamaguard-3-8b` is the most current; `llamaguard-3-1b` is friendlier for users without paid endpoint access. Any preference?
3. **Multimodal.** `Llama-Guard-3-11B-Vision` exists. Out of scope for this PR (text-only first), but is a follow-on PR for `image_path` / `video_path` something you'd accept?
4. **Prompt template.** Vendor the official template inline (auditable, no extra dependency) or fetch it from the model's HF tokenizer at runtime? I'd lean toward vendoring inline with a comment pointing at the upstream source.
5. **Endpoint validation.** Should the scorer probe the configured `chat_target` once on init to confirm it's actually serving LlamaGuard, or trust the user?

## Implementation plan

- **This PR:** Llama-Guard-3-8B and LlamaGuard-7B (legacy) text-only, via any `PromptChatTarget`. Unit tests with mocked target.
- **Follow-on PR (separate issue):** ShieldGemma scorer using the same pattern.
- **Follow-on PR (separate issue):** multimodal Llama-Guard-3-11B-Vision.

Happy to split this PR differently if that's too much for one review.

## References

- Inan et al., *Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations* (2023), [arXiv:2312.06674](https://arxiv.org/abs/2312.06674)
- Llama-Guard-3 model card: [`meta-llama/Llama-Guard-3-8B`](https://huggingface.co/meta-llama/Llama-Guard-3-8B)
- MLCommons AI Safety taxonomy v0.5 / v1.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT Add LlamaGuard scorer for safety classification of model outputs #1830

What this proposes

Why

Proposed design

Open design questions

Implementation plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FEAT Add LlamaGuard scorer for safety classification of model outputs #1830

Description

What this proposes

Why

Proposed design

Open design questions

Implementation plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions