Skip to content

feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112

Open
localai-bot wants to merge 7 commits into
masterfrom
feat/parakeet-dynamic-batching
Open

feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112
localai-bot wants to merge 7 commits into
masterfrom
feat/parakeet-dynamic-batching

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented May 31, 2026

Summary

Adds dynamic batching to the parakeet-cpp backend so concurrent
/v1/audio/transcriptions requests are coalesced into one batched call through
parakeet.cpp's batched encoder/decoder. This is a GPU throughput feature: under
concurrent load the batched path raises utilization. It is off by default
(batch_max_size: 1); raise it to opt in. On CPU it does not help (the GEMMs
already saturate the threads and padding adds work), so leave it at 1 there.

What changed (all under backend/go/parakeet-cpp/):

  • New in-process batcher (batcher.go): handler goroutines submit requests; one
    dispatcher goroutine accumulates them until batch_max_size or
    batch_max_wait_ms, then makes a single batched engine call. The dispatcher is
    the sole caller of the C engine, so engine access stays single-threaded.
  • The backend drops base.SingleThread (which serialized every call) for
    base.Base, so concurrent AudioTranscription handlers actually run and reach
    the batcher. An engineMu keeps the streaming path and batched-unary mutually
    exclusive on the one shared engine context.
  • AudioTranscription decodes the file, submits to the batcher, and shapes the
    per-item JSON exactly as before (text, word/segment timestamps, tokens).
  • Two model options: batch_max_size (default 1 = off) and batch_max_wait_ms
    (default 15). Raise batch_max_size (e.g. 4 to 16) to enable batching on GPU
    under concurrent load.
  • Docs: docs/content/features/audio-to-text.md.

Dependency

Requires the parakeet.cpp side that adds the
parakeet_capi_transcribe_pcm_batch_json C-API (batched transcription with
timestamps). The backend binds that symbol via purego at runtime, so this Go
code builds without it and falls back to per-request transcription if it is
absent. PARAKEET_VERSION is pinned to 8a7c482 (parakeet.cpp master with the
batched decode and the B=1 encoder fast-path), so the backend image ships a
libparakeet.so that has the batched path.

Test plan

  • Pure-Go batcher unit tests pass under -race: go test ./backend/go/parakeet-cpp/ -run TestBatcher -race (coalescing, size trigger, window trigger, size-1 bypass).

  • go build / go vet clean (one pre-existing unrelated unsafe.Pointer warning).

  • End-to-end on GPU (dgx, NVIDIA GB10, parakeet-tdt_ctc-110m f16). Built this branch's backend (CUDA, parakeet.cpp 8a7c482) and drove the real parakeet-cpp-grpc backend with 16 concurrent clients issuing 96 AudioTranscription requests of a ~7s clip, varying batch_max_size:

    batch_max_size throughput vs batch=1 failures
    1 (sequential) 43.1 req/s - 0/96
    4 71.4 req/s 1.66x 0/96
    8 79.2 req/s 1.84x 0/96

    All requests succeeded at every batch size, confirming the batcher to batch
    C-API (parakeet_capi_transcribe_pcm_batch_json) to batched decode path is
    correct end to end. Throughput rises ~1.84x at batch_max_size: 8 purely from
    the option, under concurrent load. This is below the decode-only microbench
    (~10-12x on the same GPU via parakeet-cli bench-decode) because the
    end-to-end path also pays for the encoder (compute-bound, no batching win), wav
    decode, and gRPC/JSON overhead per request. Encoder-only batching gave no
    end-to-end win; the decode batching is what turns concurrency into throughput.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

@mudler mudler force-pushed the feat/parakeet-dynamic-batching branch 2 times, most recently from 795d2ed to 27d7d0d Compare June 1, 2026 12:56
mudler added 7 commits June 1, 2026 21:35
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ed JSON C-API

Drop SingleThread; route unary transcription through the in-process batcher
which coalesces concurrent requests into one batched engine call. Streaming
stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms
options (size=1 disables; recommended on CPU).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…eallocate; clarify stream lock

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… with per-request fallback

The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2);
probe it with Dlsym and register optionally so the backend still loads against
an older library, falling back to per-request transcription. Rewrites the
batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…=1 fast-path)

parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder
fast-path to master. Point PARAKEET_VERSION at that commit so the backend
builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the
dynamic batcher calls; the prior pin (30a3075) predated it, so only the
per-request fallback path was exercised. Verified the shared lib builds with
the backend's CMake flags and exports the batch symbol.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler force-pushed the feat/parakeet-dynamic-batching branch from 7b6414b to 14cd9b2 Compare June 1, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants