Pure-Dart Mandarin Chinese tone detection (tones 1–4) from raw PCM audio.
No native code, no FFI - only dart:typed_data and dart:math.
PCM audio
-> VAD (RMS gate)
-> pre-emphasis (y[n] = x[n] − 0.97·x[n−1])
-> Hann framing (frame 1024, hop 256)
-> pYIN pitch tracking (CMNDF + Beta(2,18) prior + Viterbi)
-> voiced F0 contour (octave/spike/edge cleanup)
-> semitone-relative contour (resample to N=20) + shape/duration features
-> learned MLP classifier (default - ~93.7% LOSO-CV, every tone ≥90%)
or rule-based + KNN k=5 (opt-in, transparent - ~82%)
-> optional DTW reference comparison
-> ToneResult
dependencies:
tonara: ^0.1.0import 'dart:typed_data';
import 'package:tonara/tonara.dart';
final analyzer = TonaraAnalyzer(sampleRate: 16000);
// Single syllable.
final ToneResult result = analyzer.analyze(samples); // Float32List in [-1, 1]
if (result.error == null) {
print('Tone ${result.tone} (confidence ${result.confidence})');
print(result.feedback);
}
// Compare against a reference recording.
final scored = analyzer.analyzeWithReference(
samples,
reference: nativeSpeakerSamples,
expectedTone: 2,
);
print('similarity: ${scored.similarityScore}');
// Real-time streaming - one ToneFrame per detected syllable.
await for (final ToneFrame frame in analyzer.stream(micChunks)) {
print('syllable ${frame.syllableIndex}: tone ${frame.result.tone}');
}| Feature | Meaning |
|---|---|
linearSlope |
overall least-squares slope |
quadraticCoeff |
x² coefficient (positive ⇒ U-shape ⇒ tone 3) |
midpointDip |
midpoint minus endpoint mean (negative ⇒ dip) |
pitchRange |
max − min of the raw Hz contour |
startToMidSlope |
slope of the first half |
midToEndSlope |
slope of the second half |
normalizedVariance |
variance of the z-scored contour |
Two classifiers are available; TonaraAnalyzer(useModel: ...) selects between
them (default true):
- Learned model (
tone_model.dart) - a two-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features. ~93.7% leave-one-speaker-out on a corpus of 2500+ labeled Mandarin clips, every tone ≥90%. Default. - Rule-based + KNN (
classify) - a transparent decision tree on 7 shape features with a k = 5 KNN fallback. ~82%. Use it when you want interpretable decisions or no embedded weights.
Slope/curvature features are measured on a normalized [-1, 1] x-axis, so the thresholds are independent of the contour length. The cut points were tuned against the training corpus (see below).
pitchRange < 5-> tone 0 (neutral / unvoiced)pitchRange < 22 || normalizedVariance < 0.08-> tone 1 (level - a level tone has the least movement;pitchRangein Hz is its only robust cue, since z-scoring inflates a flat contour's slope)startToMidSlope < −0.2 && midToEndSlope > 0.4 && quadraticCoeff > 0.4-> tone 3 (dip: does not rise in the first half, then rises)linearSlope < −0.4-> tone 4 (falling)linearSlope > 1.0 && startToMidSlope > −0.1-> tone 2 (rising throughout)- otherwise -> KNN (k = 5) over 40 hand-tuned prototypes
These differ from a naive reading of the original design in ways the data forced: (a) the x-axis is normalized so the
linearSlope/quadraticCoeffthresholds are reachable at all; (b) tone 3 is separated from tone 2 by the first-half slope (a citation third tone also ends higher than it starts, so overall slope can't tell them apart); (c) a small pitch range means level tone 1, not tone 0.
The learned model was trained and validated on a corpus of 2500+ labeled
single-syllable Mandarin recordings (multiple native speakers; the tone and
speaker are encoded in each filename). The audio itself is not distributed -
only the trained weights ship, in lib/src/tone_model.dart. Drop your own
labeled .wav clips into audio/train/ to retrain:
dart run tool/train_model.dart # prints LOSO-CV, regenerates tone_model.darttrain_model.dart reports honest accuracy via leave-one-speaker-out
cross-validation (each speaker is classified by a model trained only on the
others), then ships weights trained on every speaker.
Learned model - 93.7% LOSO-CV with every tone above 90%:
| t1 | t2 | t3 | t4 | |
|---|---|---|---|---|
| accuracy | 98% | 90% | 91% | 96% |
The model is a two-hidden-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features.
Tones 2 and 3 are the hard pair. Tone-3 citation recordings include both full
dipping (˅) and reduced realizations - a low fall (no final rise, looks like
tone 4) or a low rise (no initial fall, looks like tone 2). These
"half-third-tones" are acoustically ambiguous from F0 alone, so the raw model
makes confident errors on the fuzzy tone-2/3 boundary that no amount of extra
features, network depth, or loss weighting could fix (all plateaued tone 3 at
~88%). Because tones 1 and 4 carry large margins (98% / 96%), the classifier
applies a per-class decision bias (decisionBias in train_model.dart)
that favours tones 2 and 3 at the boundary, pulling slack from tones 1/4 so all
four clear 90%. This is a deliberate balance choice, not a raw accuracy gain;
overall sits at ~93.7%.
The rule-based fallback (useModel: false) reaches ~82%. Its main
confusions come from z-score normalization erasing the level-tone flatness cue.
The learned model avoids this by classifying the semitone-relative contour,
which preserves both shape and the small magnitude of a level tone.
- Pre-emphasis is off by default (
applyPreEmphasis: false). It is a high-pass that attenuates the fundamental and roughly halves voiced-frame detection, so the pitch path runs on the clean signal. Enable it only for spectral experiments. - The raw F0 contour is cleaned before feature extraction: octave-error repair,
a 3-point median filter, and a one-frame edge trim (
refineF0). - Real recordings vary widely in level; peak-normalise input before
analyze(the harness does this) so the fixed RMS gate behaves consistently.
dart pub get
dart analyze
dart test
dart run example/main.dart- The KNN prototypes in
lib/src/reference_data.dartare hand-tuned from the phonetics literature, not trained on a corpus; classification is heuristic. - pYIN frequency resolution is sharpened by parabolic interpolation of the CMNDF minimum (~ sub-Hertz on a clean tone).
- Pre-emphasis is applied in the full pipeline; the single-frame
pyinFrameentry point operates on whatever frame you pass it.