feat(parakeet-cpp): dynamic batching for concurrent transcription requests by localai-bot · Pull Request #10112 · mudler/LocalAI

localai-bot · 2026-05-31T19:29:16Z

Summary

Adds dynamic batching to the parakeet-cpp backend so concurrent
/v1/audio/transcriptions requests are coalesced into one batched call through
parakeet.cpp's batched encoder/decoder. This is a GPU throughput feature: under
concurrent load the batched path raises utilization. It is off by default
(batch_max_size: 1); raise it to opt in. On CPU it does not help (the GEMMs
already saturate the threads and padding adds work), so leave it at 1 there.

What changed (all under backend/go/parakeet-cpp/):

New in-process batcher (batcher.go): handler goroutines submit requests; one
dispatcher goroutine accumulates them until batch_max_size or
batch_max_wait_ms, then makes a single batched engine call. The dispatcher is
the sole caller of the C engine, so engine access stays single-threaded.
The backend drops base.SingleThread (which serialized every call) for
base.Base, so concurrent AudioTranscription handlers actually run and reach
the batcher. An engineMu keeps the streaming path and batched-unary mutually
exclusive on the one shared engine context.
AudioTranscription decodes the file, submits to the batcher, and shapes the
per-item JSON exactly as before (text, word/segment timestamps, tokens).
Two model options: batch_max_size (default 1 = off) and batch_max_wait_ms
(default 15). Raise batch_max_size (e.g. 4 to 16) to enable batching on GPU
under concurrent load.
Docs: docs/content/features/audio-to-text.md.

Dependency

Requires the parakeet.cpp side that adds the
parakeet_capi_transcribe_pcm_batch_json C-API (batched transcription with
timestamps). The backend binds that symbol via purego at runtime, so this Go
code builds without it and falls back to per-request transcription if it is
absent. PARAKEET_VERSION is pinned to 8a7c482 (parakeet.cpp master with the
batched decode and the B=1 encoder fast-path), so the backend image ships a
libparakeet.so that has the batched path.

Test plan

Pure-Go batcher unit tests pass under -race: go test ./backend/go/parakeet-cpp/ -run TestBatcher -race (coalescing, size trigger, window trigger, size-1 bypass).
go build / go vet clean (one pre-existing unrelated unsafe.Pointer warning).
End-to-end on GPU (dgx, NVIDIA GB10, parakeet-tdt_ctc-110m f16). Built this branch's backend (CUDA, parakeet.cpp 8a7c482) and drove the real parakeet-cpp-grpc backend with 16 concurrent clients issuing 96 AudioTranscription requests of a ~7s clip, varying batch_max_size:

batch_max_size throughput vs batch=1 failures

1 (sequential) 43.1 req/s - 0/96

4 71.4 req/s 1.66x 0/96

8 79.2 req/s 1.84x 0/96

All requests succeeded at every batch size, confirming the batcher to batch
C-API (parakeet_capi_transcribe_pcm_batch_json) to batched decode path is
correct end to end. Throughput rises ~1.84x at batch_max_size: 8 purely from
the option, under concurrent load. This is below the decode-only microbench
(~10-12x on the same GPU via parakeet-cli bench-decode) because the
end-to-end path also pays for the encoder (compute-bound, no batching win), wav
decode, and gRPC/JSON overhead per request. Encoder-only batching gave no
end-to-end win; the decode batching is what turns concurrency into throughput.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ed JSON C-API Drop SingleThread; route unary transcription through the in-process batcher which coalesces concurrent requests into one batched engine call. Streaming stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms options (size=1 disables; recommended on CPU). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…eallocate; clarify stream lock Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… with per-request fallback The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2); probe it with Dlsym and register optionally so the backend still loads against an older library, falling back to per-request transcription. Rewrites the batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Dynamic batching now defaults off (batch_max_size:1, one request at a time). Raise batch_max_size to opt in: it is a large throughput win on GPU under concurrent load, but on CPU and low-concurrency setups it only adds latency, so off is the safer default. The startup log now states whether batching is on or off, and the audio-to-text docs are updated to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…=1 fast-path) parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder fast-path to master. Point PARAKEET_VERSION at that commit so the backend builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the dynamic batcher calls; the prior pin (30a3075) predated it, so only the per-request fallback path was exercised. Verified the shared lib builds with the backend's CMake flags and exports the batch symbol. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler force-pushed the feat/parakeet-dynamic-batching branch 2 times, most recently from 795d2ed to 27d7d0d Compare June 1, 2026 12:56

mudler added 7 commits June 1, 2026 21:35

feat(parakeet-cpp): dynamic-batching scheduler (queue + dispatcher)

edfbc9a

Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

fix(parakeet-cpp): tear down dispatcher in Free; log batch config; pr…

fbad6a9

…eallocate; clarify stream lock Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

feat(parakeet-cpp): debug-log coalesced batch size in runBatch

8ab6150

Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the feat/parakeet-dynamic-batching branch from 7b6414b to 14cd9b2 Compare June 1, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112

feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112
localai-bot wants to merge 7 commits into
masterfrom
feat/parakeet-dynamic-batching

localai-bot commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`batch_max_size`	throughput	vs batch=1	failures
1 (sequential)	43.1 req/s	-	0/96
4	71.4 req/s	1.66x	0/96
8	79.2 req/s	1.84x	0/96

Uh oh!

Conversation

localai-bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 31, 2026 •

edited

Loading