Skip to content

FEAT Add LlamaGuard scorer for safety classification of model outputs #1830

@immu4989

Description

@immu4989

What this proposes

A new scorer that wraps Meta's LlamaGuard models (LlamaGuard-7B / Llama-Guard-3-8B / Llama-Guard-3-1B) and returns a safe / unsafe verdict on a message, plus the list of violated safety categories (S1–S14 in the MLCommons taxonomy) attached to the score as metadata.

The scorer talks to LlamaGuard through any PromptChatTarget, so users can point it at HuggingFace Inference, Together, Groq, Fireworks, a local vLLM/TGI deployment, or any other OpenAI-compatible endpoint. No local transformers / torch dependency.

Why

LlamaGuard is one of the most widely used open-weight safety classifiers for LLM red-teaming work (Llama-Guard-3, Aug 2024). Today PyRIT has no built-in way to use it:

  • pyrit/score/ has no scorer that wraps an externally-hosted safety classifier. The closest existing scorers are azure_content_filter_scorer.py (Azure-specific) and prompt_shield_scorer.py (Microsoft Prompt Shields).
  • PyRIT already loads the HarmBench dataset (pyrit/datasets/seed_datasets/remote/harmbench_dataset.py), but scoring HarmBench responses today goes through generic Likert / self-ask scorers rather than a purpose-built safety classifier.
  • A clean LlamaGuard scorer also gives PyRIT a reusable pattern for future safety-classifier wrappers (ShieldGemma, WildGuard, the HarmBench-paper classifier, etc.) without each one inventing its own abstraction.

Proposed design

File: pyrit/score/true_false/llamaguard_scorer.py
Class: LlamaGuardScorer(TrueFalseScorer)

Constructor:

LlamaGuardScorer(
    chat_target: PromptChatTarget,
    *,
    model_version: Literal["llamaguard-7b", "llamaguard-3-8b", "llamaguard-3-1b"] = "llamaguard-3-8b",
    categories: Optional[list[str]] = None,   # default = official taxonomy for the chosen version
    validator: Optional[ScorerPromptValidator] = None,
)

Behavior:

  1. Format the message into the LlamaGuard prompt using the official chat template for the selected version (the original [INST] format for v1/v2, the new chat-template format for v3).
  2. Send the formatted prompt to chat_target.send_prompt_async(...).
  3. Parse the response:
    • First line "safe"score_value=False.
    • First line "unsafe"score_value=True; second line holds the violated category codes (e.g. S1, S6) which go into Score.score_metadata["violated_categories"].
    • Anything else → handled as a bad output (logged, raises a clear error or marks the score with an explanation).
  4. The raw classifier output is preserved in the score's rationale string for auditability.

Test pattern: modeled on tests/unit/score/test_self_ask_general_true_false_scorer.py. Mock the chat target, exercise the three response shapes (safe / unsafe / bad output), and verify the metadata round-trip.

Open design questions

I'd like maintainer input on a few choices before writing code:

  1. Base class. TrueFalseScorer with categories on score_metadata feels natural for a binary verdict. Would you prefer the violated categories surfaced as separate Score objects (one per category), so score_aggregator can fold them in?
  2. Default model version. llamaguard-3-8b is the most current; llamaguard-3-1b is friendlier for users without paid endpoint access. Any preference?
  3. Multimodal. Llama-Guard-3-11B-Vision exists. Out of scope for this PR (text-only first), but is a follow-on PR for image_path / video_path something you'd accept?
  4. Prompt template. Vendor the official template inline (auditable, no extra dependency) or fetch it from the model's HF tokenizer at runtime? I'd lean toward vendoring inline with a comment pointing at the upstream source.
  5. Endpoint validation. Should the scorer probe the configured chat_target once on init to confirm it's actually serving LlamaGuard, or trust the user?

Implementation plan

  • This PR: Llama-Guard-3-8B and LlamaGuard-7B (legacy) text-only, via any PromptChatTarget. Unit tests with mocked target.
  • Follow-on PR (separate issue): ShieldGemma scorer using the same pattern.
  • Follow-on PR (separate issue): multimodal Llama-Guard-3-11B-Vision.

Happy to split this PR differently if that's too much for one review.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions