Tonara

Pure-Dart Mandarin Chinese tone detection (tones 1–4) from raw PCM audio. No native code, no FFI - only dart:typed_data and dart:math.

Pipeline

PCM audio
  -> VAD (RMS gate)
  -> pre-emphasis (y[n] = x[n] − 0.97·x[n−1])
  -> Hann framing (frame 1024, hop 256)
  -> pYIN pitch tracking (CMNDF + Beta(2,18) prior + Viterbi)
  -> voiced F0 contour (octave/spike/edge cleanup)
  -> semitone-relative contour (resample to N=20) + shape/duration features
  -> learned MLP classifier   (default - ~93.7% LOSO-CV, every tone ≥90%)
     or rule-based + KNN k=5  (opt-in, transparent - ~82%)
  -> optional DTW reference comparison
  -> ToneResult

Install

dependencies:
  tonara: ^0.1.0

Usage

import 'dart:typed_data';
import 'package:tonara/tonara.dart';

final analyzer = TonaraAnalyzer(sampleRate: 16000);

// Single syllable.
final ToneResult result = analyzer.analyze(samples); // Float32List in [-1, 1]
if (result.error == null) {
  print('Tone ${result.tone} (confidence ${result.confidence})');
  print(result.feedback);
}

// Compare against a reference recording.
final scored = analyzer.analyzeWithReference(
  samples,
  reference: nativeSpeakerSamples,
  expectedTone: 2,
);
print('similarity: ${scored.similarityScore}');

// Real-time streaming - one ToneFrame per detected syllable.
await for (final ToneFrame frame in analyzer.stream(micChunks)) {
  print('syllable ${frame.syllableIndex}: tone ${frame.result.tone}');
}

The seven features

Feature	Meaning
`linearSlope`	overall least-squares slope
`quadraticCoeff`	x² coefficient (positive ⇒ U-shape ⇒ tone 3)
`midpointDip`	midpoint minus endpoint mean (negative ⇒ dip)
`pitchRange`	max − min of the raw Hz contour
`startToMidSlope`	slope of the first half
`midToEndSlope`	slope of the second half
`normalizedVariance`	variance of the z-scored contour

Classification

Two classifiers are available; TonaraAnalyzer(useModel: ...) selects between them (default true):

Learned model (tone_model.dart) - a two-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features. ~93.7% leave-one-speaker-out on a corpus of 2500+ labeled Mandarin clips, every tone ≥90%. Default.
Rule-based + KNN (classify) - a transparent decision tree on 7 shape features with a k = 5 KNN fallback. ~82%. Use it when you want interpretable decisions or no embedded weights.

Rule-based decision tree (`useModel: false`)

Slope/curvature features are measured on a normalized [-1, 1] x-axis, so the thresholds are independent of the contour length. The cut points were tuned against the training corpus (see below).

pitchRange < 5 -> tone 0 (neutral / unvoiced)
pitchRange < 22 || normalizedVariance < 0.08 -> tone 1 (level - a level tone has the least movement; pitchRange in Hz is its only robust cue, since z-scoring inflates a flat contour's slope)
startToMidSlope < −0.2 && midToEndSlope > 0.4 && quadraticCoeff > 0.4 -> tone 3 (dip: does not rise in the first half, then rises)
linearSlope < −0.4 -> tone 4 (falling)
linearSlope > 1.0 && startToMidSlope > −0.1 -> tone 2 (rising throughout)
otherwise -> KNN (k = 5) over 40 hand-tuned prototypes

These differ from a naive reading of the original design in ways the data forced: (a) the x-axis is normalized so the linearSlope/quadraticCoeff thresholds are reachable at all; (b) tone 3 is separated from tone 2 by the first-half slope (a citation third tone also ends higher than it starts, so overall slope can't tell them apart); (c) a small pitch range means level tone 1, not tone 0.

Validation on real audio

The learned model was trained and validated on a corpus of 2500+ labeled single-syllable Mandarin recordings (multiple native speakers; the tone and speaker are encoded in each filename). The audio itself is not distributed - only the trained weights ship, in lib/src/tone_model.dart. Drop your own labeled .wav clips into audio/train/ to retrain:

dart run tool/train_model.dart   # prints LOSO-CV, regenerates tone_model.dart

train_model.dart reports honest accuracy via leave-one-speaker-out cross-validation (each speaker is classified by a model trained only on the others), then ships weights trained on every speaker.

Learned model - 93.7% LOSO-CV with every tone above 90%:

	t1	t2	t3	t4
accuracy	98%	90%	91%	96%

The model is a two-hidden-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features.

Tones 2 and 3 are the hard pair. Tone-3 citation recordings include both full dipping (˅) and reduced realizations - a low fall (no final rise, looks like tone 4) or a low rise (no initial fall, looks like tone 2). These "half-third-tones" are acoustically ambiguous from F0 alone, so the raw model makes confident errors on the fuzzy tone-2/3 boundary that no amount of extra features, network depth, or loss weighting could fix (all plateaued tone 3 at ~88%). Because tones 1 and 4 carry large margins (98% / 96%), the classifier applies a per-class decision bias (decisionBias in train_model.dart) that favours tones 2 and 3 at the boundary, pulling slack from tones 1/4 so all four clear 90%. This is a deliberate balance choice, not a raw accuracy gain; overall sits at ~93.7%.

The rule-based fallback (useModel: false) reaches ~82%. Its main confusions come from z-score normalization erasing the level-tone flatness cue. The learned model avoids this by classifying the semitone-relative contour, which preserves both shape and the small magnitude of a level tone.

Pitch & preprocessing notes

Pre-emphasis is off by default (applyPreEmphasis: false). It is a high-pass that attenuates the fundamental and roughly halves voiced-frame detection, so the pitch path runs on the clean signal. Enable it only for spectral experiments.
The raw F0 contour is cleaned before feature extraction: octave-error repair, a 3-point median filter, and a one-frame edge trim (refineF0).
Real recordings vary widely in level; peak-normalise input before analyze (the harness does this) so the fixed RMS gate behaves consistently.

Development

dart pub get
dart analyze
dart test
dart run example/main.dart

Notes & limitations

The KNN prototypes in lib/src/reference_data.dart are hand-tuned from the phonetics literature, not trained on a corpus; classification is heuristic.
pYIN frequency resolution is sharpened by parabolic interpolation of the CMNDF minimum (~ sub-Hertz on a clean tone).
Pre-emphasis is applied in the full pipeline; the single-frame pyinFrame entry point operates on whatever frame you pass it.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
example		example
lib		lib
test		test
tool		tool
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
analysis_options.yaml		analysis_options.yaml
pubspec.yaml		pubspec.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tonara

Pipeline

Install

Usage

The seven features

Classification

Rule-based decision tree (`useModel: false`)

Validation on real audio

Pitch & preprocessing notes

Development

Notes & limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tonara

Pipeline

Install

Usage

The seven features

Classification

Rule-based decision tree (useModel: false)

Validation on real audio

Pitch & preprocessing notes

Development

Notes & limitations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rule-based decision tree (`useModel: false`)

Packages