feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112
Open
localai-bot wants to merge 7 commits into
Open
feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112localai-bot wants to merge 7 commits into
localai-bot wants to merge 7 commits into
Conversation
795d2ed to
27d7d0d
Compare
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ed JSON C-API Drop SingleThread; route unary transcription through the in-process batcher which coalesces concurrent requests into one batched engine call. Streaming stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms options (size=1 disables; recommended on CPU). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…eallocate; clarify stream lock Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… with per-request fallback The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2); probe it with Dlsym and register optionally so the backend still loads against an older library, falling back to per-request transcription. Rewrites the batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Dynamic batching now defaults off (batch_max_size:1, one request at a time). Raise batch_max_size to opt in: it is a large throughput win on GPU under concurrent load, but on CPU and low-concurrency setups it only adds latency, so off is the safer default. The startup log now states whether batching is on or off, and the audio-to-text docs are updated to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…=1 fast-path) parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder fast-path to master. Point PARAKEET_VERSION at that commit so the backend builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the dynamic batcher calls; the prior pin (30a3075) predated it, so only the per-request fallback path was exercised. Verified the shared lib builds with the backend's CMake flags and exports the batch symbol. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
7b6414b to
14cd9b2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds dynamic batching to the
parakeet-cppbackend so concurrent/v1/audio/transcriptionsrequests are coalesced into one batched call throughparakeet.cpp's batched encoder/decoder. This is a GPU throughput feature: under
concurrent load the batched path raises utilization. It is off by default
(
batch_max_size: 1); raise it to opt in. On CPU it does not help (the GEMMsalready saturate the threads and padding adds work), so leave it at 1 there.
What changed (all under
backend/go/parakeet-cpp/):batcher.go): handler goroutines submit requests; onedispatcher goroutine accumulates them until
batch_max_sizeorbatch_max_wait_ms, then makes a single batched engine call. The dispatcher isthe sole caller of the C engine, so engine access stays single-threaded.
base.SingleThread(which serialized every call) forbase.Base, so concurrentAudioTranscriptionhandlers actually run and reachthe batcher. An
engineMukeeps the streaming path and batched-unary mutuallyexclusive on the one shared engine context.
AudioTranscriptiondecodes the file, submits to the batcher, and shapes theper-item JSON exactly as before (text, word/segment timestamps, tokens).
batch_max_size(default 1 = off) andbatch_max_wait_ms(default 15). Raise
batch_max_size(e.g. 4 to 16) to enable batching on GPUunder concurrent load.
docs/content/features/audio-to-text.md.Dependency
Requires the parakeet.cpp side that adds the
parakeet_capi_transcribe_pcm_batch_jsonC-API (batched transcription withtimestamps). The backend binds that symbol via purego at runtime, so this Go
code builds without it and falls back to per-request transcription if it is
absent.
PARAKEET_VERSIONis pinned to8a7c482(parakeet.cpp master with thebatched decode and the B=1 encoder fast-path), so the backend image ships a
libparakeet.sothat has the batched path.Test plan
Pure-Go batcher unit tests pass under
-race:go test ./backend/go/parakeet-cpp/ -run TestBatcher -race(coalescing, size trigger, window trigger, size-1 bypass).go build/go vetclean (one pre-existing unrelated unsafe.Pointer warning).End-to-end on GPU (dgx, NVIDIA GB10,
parakeet-tdt_ctc-110mf16). Built this branch's backend (CUDA, parakeet.cpp8a7c482) and drove the realparakeet-cpp-grpcbackend with 16 concurrent clients issuing 96AudioTranscriptionrequests of a ~7s clip, varyingbatch_max_size:batch_max_sizeAll requests succeeded at every batch size, confirming the batcher to batch
C-API (
parakeet_capi_transcribe_pcm_batch_json) to batched decode path iscorrect end to end. Throughput rises ~1.84x at
batch_max_size: 8purely fromthe option, under concurrent load. This is below the decode-only microbench
(~10-12x on the same GPU via
parakeet-cli bench-decode) because theend-to-end path also pays for the encoder (compute-bound, no batching win), wav
decode, and gRPC/JSON overhead per request. Encoder-only batching gave no
end-to-end win; the decode batching is what turns concurrency into throughput.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]