FEAT: Realtime streaming session support and server-side barge-in attack#1766
FEAT: Realtime streaming session support and server-side barge-in attack#1766adrian-gavrila wants to merge 37 commits into
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ardown Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nc rename, Optional→union Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…me-server-vad # Conflicts: # pyrit/prompt_target/openai/openai_realtime_target.py
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imitive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… inline drive_response Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I'm a little concerned with how this attack deviates from the other attacks in that it doesn't use the send_prompt_async workflow. The attack is doing a lot of plumbing that other attacks delegate to send_async — audio normalization (bypasses the pipeline), the swap + response trigger, message construction, memory persistence. That makes it inconsistent with the standard "build prompt → send_prompt_async → response" workflow and tightly couples it to RealtimeTarget internals. Could we push the streaming session work into RealtimeTarget.send_async so the attack just does: Connect + subscribe |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…' into adrian-gavrila/realtime-server-vad
…idation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mitted-turn handler Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n Protocol Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…' into adrian-gavrila/realtime-server-vad
| is None or 0, or when the computed trim would leave nothing. | ||
|
|
||
| Raises: | ||
| ValueError: If ``audio_start_ms`` is negative. |
Description
Adds persistent streaming session support to
OpenAIRealtimeTargetand introducesBargeInAttack, a streaming attack that leverages server-side VAD to detect and exploit barge-in (interruption) behavior. Previously the target only supported single-turn fire-and-forget audio exchanges; this PR adds the transport primitives needed for multi-turn streaming sessions with incremental audio push, event subscription, and mid-session response requests.When the server detects new user speech while the assistant is still responding, the in-flight response is automatically interrupted and the conversation history is truncated to match what was actually delivered.
Key additions:
OpenAIRealtimeTargetstreaming primitives —connect_async,push_audio_chunk_async,insert_user_audio_async,subscribe_events_async,request_response_async,send_streaming_session_config_async. These expose transport-level operations over a persistent WebSocket connection._RealtimeEventDispatcher— ABC that owns a realtime connection's event stream, routes provider-specific events to the active turn, and fires anon_user_audio_committedcallback when server VAD finalizes a turn. Provider-specific routing is isolated to_route_event/_cancelabstract methods.BargeInAttack— streaming attack that pushes audio chunks into a persistent session, applies configured converters on each server-committed turn (convert-on-commit), requests responses, and tracks interruptions. Per-turnMessagepairs are persisted toCentralMemorywithprompt_metadata["interrupted"] = Trueon interrupted turns.ServerVadConfig/RealtimeTargetResult— shared types for configuring server VAD and representing turn results (audio, transcripts, interruption flag).PromptNormalizer.convert_audio_async— applies audio converter configurations to raw PCM bytes for streaming attacks that hold audio mid-turn rather than aMessage.The target exposes only transport primitives; all attack logic (buffering, convert-on-commit dance, interruption signaling) lives in
BargeInAttack.Tests and Documentation
realtime_audio.py, 72% onopenai_realtime_target.py(uncovered lines are pre-existing code paths, not new additions).doc/code/executor/attack/barge_in_attack.py(jupytext py:percent format) demonstrates the attack against a live OpenAI Realtime API endpoint with server VAD. Ran successfully againstgpt-4o-realtime-preview— outputs cleared for CI (requires live credentials).