ML Architecture — Iteration 01

Automated Guitar Scale
Performance Feedback

// A CNN + LSTM pipeline for real-time note-level diagnosis and holistic performance analysis

STUDENT Marcello de la Maz
SCOPE C Major Scale / Major Scales
STACK PyTorch · librosa · Streamlit
TIMELINE 5 Weeks
FULL PIPELINE OVERVIEW
01 —
Audio Input
Raw .wav recording of C major scale
librosa
↗ click to expand
02 —
Mel Spectrogram
2D time-frequency representation
preprocess.py
↗ click to expand
03 —
Segmentation
8 note patches via onset detection
CNN
↗ click to expand
04 —
Classification
Timing · Tuning · Timbre per note
3× CNN
↗ click to expand
05 —
Holistic Feedback
Pattern analysis across full scale
LSTM
↗ click to expand
06 —
App Output
Actionable diagnosis via Streamlit
app.py
↗ click to expand
01 / 04
AUDIO REPRESENTATION

Mel Spectrogram

Raw guitar audio is converted into a Mel spectrogram — a 2D image where the X-axis represents time, the Y-axis represents frequency (warped to human pitch perception), and pixel brightness represents energy.


This single representation encodes all three performance features simultaneously: timing (horizontal position of onsets), tuning (vertical frequency alignment), and timbre (spectral texture and harmonic clarity).


Library: librosa · Format: .npy array

DECISION
Mel Spectrogram over raw FFT or waveform — captures time + frequency in one representation, compatible with CNN input
Encodes timing, pitch, timbre in one structure
Industry standard for audio ML tasks
Pitch-relative training makes model scale-agnostic
Dataset: 150 C major recordings + pitch-shift augmentation
02 / 04
SEGMENTATION

Onset Detection

Each scale recording contains 8 consecutive notes. Segmentation detects where each note starts and ends by identifying sudden bursts of spectral energy — note onsets — in the spectrogram.


A two-phase approach is used: rule-based onset detection generates initial labels automatically, then a lightweight CNN onset detector is trained on those labels for robustness.


Output: 8 spectrogram patches — one per note — passed to the classifier.

DECISION
Two-phase: librosa rule-based for auto-labeling → CNN onset detector trained on those labels
Phase 1: librosa auto-generates 150 recordings worth of labels
Phase 2: CNN refines and improves on rule-based results
Recording requirements: metronome, quiet room, consistent equipment
03 / 04
PER-NOTE CLASSIFICATION

Three CNN Classifiers

Each of the 8 note patches is independently classified across three features using separate CNN models — one per feature. Each CNN takes a spectrogram patch as input and outputs a binary classification.


Keeping classifiers separate allows each CNN to specialize in its own visual pattern, enables independent debugging, and makes the system extensible.


Auto-labeling: tuning via librosa.pyin (cents deviation), timing via onset offset from metronome grid, timbre via spectral noisiness measure.

CLASSIFIER 01 — TIMING
On time / Off time — onset position relative to metronome beat (threshold: ±50ms)
CLASSIFIER 02 — TUNING
In tune / Out of tune — dominant frequency vs. target pitch (threshold: ±20 cents)
CLASSIFIER 03 — TIMBRE
Clear / Unclear — spectral noisiness, harmonic clarity, buzz detection
04 / 04
HOLISTIC FEEDBACK

LSTM Pattern Analysis

After classification, each note produces a feature vector [timing, tuning, timbre]. These 8 vectors are fed sequentially into an LSTM that maintains memory across the full scale — detecting patterns that are invisible at the note level.


Simple weight multiplication was explicitly rejected: two students with identical average scores can have completely different error patterns requiring different interventions. The LSTM learns to distinguish them.


Dataset strategy: 150 recordings split across 5 intentional error categories for balanced LSTM training.

DECISION
LSTM over weighted scoring — sequences matter. Progressive errors, positional errors, and global errors require different feedback.
Category 1: Perfect / near-perfect — control group, what good looks like
Category 2: Timing consistent errors throughout — metronome practice needed
Category 3: Timing errors only in upper half — endurance / focus issue
Category 4: Global tuning issues — general intonation problem
Category 5: Positional tuning or timbre issues — specific finger placement or pressure