Skip to content

dima-xd/tonara

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tonara

Pure-Dart Mandarin Chinese tone detection (tones 1–4) from raw PCM audio. No native code, no FFI - only dart:typed_data and dart:math.

Pipeline

PCM audio
  -> VAD (RMS gate)
  -> pre-emphasis (y[n] = x[n] − 0.97·x[n−1])
  -> Hann framing (frame 1024, hop 256)
  -> pYIN pitch tracking (CMNDF + Beta(2,18) prior + Viterbi)
  -> voiced F0 contour (octave/spike/edge cleanup)
  -> semitone-relative contour (resample to N=20) + shape/duration features
  -> learned MLP classifier   (default - ~93.7% LOSO-CV, every tone ≥90%)
     or rule-based + KNN k=5  (opt-in, transparent - ~82%)
  -> optional DTW reference comparison
  -> ToneResult

Install

dependencies:
  tonara: ^0.1.0

Usage

import 'dart:typed_data';
import 'package:tonara/tonara.dart';

final analyzer = TonaraAnalyzer(sampleRate: 16000);

// Single syllable.
final ToneResult result = analyzer.analyze(samples); // Float32List in [-1, 1]
if (result.error == null) {
  print('Tone ${result.tone} (confidence ${result.confidence})');
  print(result.feedback);
}

// Compare against a reference recording.
final scored = analyzer.analyzeWithReference(
  samples,
  reference: nativeSpeakerSamples,
  expectedTone: 2,
);
print('similarity: ${scored.similarityScore}');

// Real-time streaming - one ToneFrame per detected syllable.
await for (final ToneFrame frame in analyzer.stream(micChunks)) {
  print('syllable ${frame.syllableIndex}: tone ${frame.result.tone}');
}

The seven features

Feature Meaning
linearSlope overall least-squares slope
quadraticCoeff x² coefficient (positive ⇒ U-shape ⇒ tone 3)
midpointDip midpoint minus endpoint mean (negative ⇒ dip)
pitchRange max − min of the raw Hz contour
startToMidSlope slope of the first half
midToEndSlope slope of the second half
normalizedVariance variance of the z-scored contour

Classification

Two classifiers are available; TonaraAnalyzer(useModel: ...) selects between them (default true):

  • Learned model (tone_model.dart) - a two-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features. ~93.7% leave-one-speaker-out on a corpus of 2500+ labeled Mandarin clips, every tone ≥90%. Default.
  • Rule-based + KNN (classify) - a transparent decision tree on 7 shape features with a k = 5 KNN fallback. ~82%. Use it when you want interpretable decisions or no embedded weights.

Rule-based decision tree (useModel: false)

Slope/curvature features are measured on a normalized [-1, 1] x-axis, so the thresholds are independent of the contour length. The cut points were tuned against the training corpus (see below).

  1. pitchRange < 5 -> tone 0 (neutral / unvoiced)
  2. pitchRange < 22 || normalizedVariance < 0.08 -> tone 1 (level - a level tone has the least movement; pitchRange in Hz is its only robust cue, since z-scoring inflates a flat contour's slope)
  3. startToMidSlope < −0.2 && midToEndSlope > 0.4 && quadraticCoeff > 0.4 -> tone 3 (dip: does not rise in the first half, then rises)
  4. linearSlope < −0.4 -> tone 4 (falling)
  5. linearSlope > 1.0 && startToMidSlope > −0.1 -> tone 2 (rising throughout)
  6. otherwise -> KNN (k = 5) over 40 hand-tuned prototypes

These differ from a naive reading of the original design in ways the data forced: (a) the x-axis is normalized so the linearSlope/quadraticCoeff thresholds are reachable at all; (b) tone 3 is separated from tone 2 by the first-half slope (a citation third tone also ends higher than it starts, so overall slope can't tell them apart); (c) a small pitch range means level tone 1, not tone 0.

Validation on real audio

The learned model was trained and validated on a corpus of 2500+ labeled single-syllable Mandarin recordings (multiple native speakers; the tone and speaker are encoded in each filename). The audio itself is not distributed - only the trained weights ship, in lib/src/tone_model.dart. Drop your own labeled .wav clips into audio/train/ to retrain:

dart run tool/train_model.dart   # prints LOSO-CV, regenerates tone_model.dart

train_model.dart reports honest accuracy via leave-one-speaker-out cross-validation (each speaker is classified by a model trained only on the others), then ships weights trained on every speaker.

Learned model - 93.7% LOSO-CV with every tone above 90%:

t1 t2 t3 t4
accuracy 98% 90% 91% 96%

The model is a two-hidden-layer MLP (32 -> 48 -> 24 -> 4) over the semitone-relative contour plus shape/duration summary features.

Tones 2 and 3 are the hard pair. Tone-3 citation recordings include both full dipping (˅) and reduced realizations - a low fall (no final rise, looks like tone 4) or a low rise (no initial fall, looks like tone 2). These "half-third-tones" are acoustically ambiguous from F0 alone, so the raw model makes confident errors on the fuzzy tone-2/3 boundary that no amount of extra features, network depth, or loss weighting could fix (all plateaued tone 3 at ~88%). Because tones 1 and 4 carry large margins (98% / 96%), the classifier applies a per-class decision bias (decisionBias in train_model.dart) that favours tones 2 and 3 at the boundary, pulling slack from tones 1/4 so all four clear 90%. This is a deliberate balance choice, not a raw accuracy gain; overall sits at ~93.7%.

The rule-based fallback (useModel: false) reaches ~82%. Its main confusions come from z-score normalization erasing the level-tone flatness cue. The learned model avoids this by classifying the semitone-relative contour, which preserves both shape and the small magnitude of a level tone.

Pitch & preprocessing notes

  • Pre-emphasis is off by default (applyPreEmphasis: false). It is a high-pass that attenuates the fundamental and roughly halves voiced-frame detection, so the pitch path runs on the clean signal. Enable it only for spectral experiments.
  • The raw F0 contour is cleaned before feature extraction: octave-error repair, a 3-point median filter, and a one-frame edge trim (refineF0).
  • Real recordings vary widely in level; peak-normalise input before analyze (the harness does this) so the fixed RMS gate behaves consistently.

Development

dart pub get
dart analyze
dart test
dart run example/main.dart

Notes & limitations

  • The KNN prototypes in lib/src/reference_data.dart are hand-tuned from the phonetics literature, not trained on a corpus; classification is heuristic.
  • pYIN frequency resolution is sharpened by parabolic interpolation of the CMNDF minimum (~ sub-Hertz on a clean tone).
  • Pre-emphasis is applied in the full pipeline; the single-frame pyinFrame entry point operates on whatever frame you pass it.

About

Pure-Dart Mandarin tone detection (tones 1–4) from raw PCM - pYIN pitch tracking + an MLP classifier hitting ~93.7% accuracy. No native code, no FFI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages