Guitar Scale Feedback — ML Architecture

01 / 04

AUDIO REPRESENTATION

Mel Spectrogram

Raw guitar audio is converted into a Mel spectrogram — a 2D image where the X-axis represents time, the Y-axis represents frequency (warped to human pitch perception), and pixel brightness represents energy.

This single representation encodes all three performance features simultaneously: timing (horizontal position of onsets), tuning (vertical frequency alignment), and timbre (spectral texture and harmonic clarity).

Library: librosa · Format: .npy array

DECISION

Mel Spectrogram over raw FFT or waveform — captures time + frequency in one representation, compatible with CNN input

✓Encodes timing, pitch, timbre in one structure

✓Industry standard for audio ML tasks

✓Pitch-relative training makes model scale-agnostic

→Dataset: 150 C major recordings + pitch-shift augmentation

02 / 04

SEGMENTATION

Onset Detection

Each scale recording contains 8 consecutive notes. Segmentation detects where each note starts and ends by identifying sudden bursts of spectral energy — note onsets — in the spectrogram.

A two-phase approach is used: rule-based onset detection generates initial labels automatically, then a lightweight CNN onset detector is trained on those labels for robustness.

Output: 8 spectrogram patches — one per note — passed to the classifier.

DECISION

Two-phase: librosa rule-based for auto-labeling → CNN onset detector trained on those labels

✓Phase 1: librosa auto-generates 150 recordings worth of labels

✓Phase 2: CNN refines and improves on rule-based results

→Recording requirements: metronome, quiet room, consistent equipment

03 / 04

PER-NOTE CLASSIFICATION

Three CNN Classifiers

Each of the 8 note patches is independently classified across three features using separate CNN models — one per feature. Each CNN takes a spectrogram patch as input and outputs a binary classification.

Keeping classifiers separate allows each CNN to specialize in its own visual pattern, enables independent debugging, and makes the system extensible.

Auto-labeling: tuning via librosa.pyin (cents deviation), timing via onset offset from metronome grid, timbre via spectral noisiness measure.

CLASSIFIER 01 — TIMING

On time / Off time — onset position relative to metronome beat (threshold: ±50ms)

CLASSIFIER 02 — TUNING

In tune / Out of tune — dominant frequency vs. target pitch (threshold: ±20 cents)

CLASSIFIER 03 — TIMBRE

Clear / Unclear — spectral noisiness, harmonic clarity, buzz detection

04 / 04

HOLISTIC FEEDBACK

LSTM Pattern Analysis

After classification, each note produces a feature vector [timing, tuning, timbre]. These 8 vectors are fed sequentially into an LSTM that maintains memory across the full scale — detecting patterns that are invisible at the note level.

Simple weight multiplication was explicitly rejected: two students with identical average scores can have completely different error patterns requiring different interventions. The LSTM learns to distinguish them.

Dataset strategy: 150 recordings split across 5 intentional error categories for balanced LSTM training.

DECISION

LSTM over weighted scoring — sequences matter. Progressive errors, positional errors, and global errors require different feedback.

→Category 1: Perfect / near-perfect — control group, what good looks like

→Category 2: Timing consistent errors throughout — metronome practice needed

→Category 3: Timing errors only in upper half — endurance / focus issue

→Category 4: Global tuning issues — general intonation problem

→Category 5: Positional tuning or timbre issues — specific finger placement or pressure

Automated Guitar ScalePerformance Feedback

Mel Spectrogram

Onset Detection

Three CNN Classifiers

LSTM Pattern Analysis

Automated Guitar Scale
Performance Feedback