Music Classification with Deep Learning
IEEE-format research paper with novel noise-augmentation for robust music identification
Overview
An IEEE-format research paper presenting a novel approach to music identification using CNNs, with a full end-to-end implementation. The core contribution is a noise-augmentation training strategy that overlays 5-20 ambient noise sources per sample for real-world robustness, combined with a per-song classification formulation (57 songs = 57 classes) that is fundamentally harder than genre classification. Three architectures are compared (fully-connected baseline, AlexNet, ResNet-50), validated by a smartphone-based practical evaluation achieving 86.4% top-1 and 96% top-5 accuracy on live recordings.
Architecture
The system is a full ML pipeline: raw audio is fragmented into 5-second clips, augmented with 5-20 randomly-selected ambient noise layers, converted to mel-spectrograms (128x256 via a precisely computed hop_length=431), and fed into CNN classifiers. Inference uses a sliding-window ensemble that aggregates probability distributions across overlapping time windows, creating a multi-vote ensemble from a single model. The Django API accepts smartphone recordings via ngrok and returns top-5 predictions.
Key Concepts
Multi-Noise Augmentation
Training strategy that layers 5-20 randomly-selected ambient noise segments (city traffic, rain, pub, wind, etc.) on each sample with random amplitude shifts. The song fragment gets a +4dB boost before noise, and noises are composited iteratively to create realistic ambient profiles.
CNN Architecture Comparison
Systematic comparison of three architectures — fully-connected baseline (78.3%), AlexNet with LeakyReLU/BatchNorm (92.9% in 32 epochs), and ResNet-50 built from scratch with a novel dropout layer (94.4% in 198 epochs). Identifies the accuracy-vs-efficiency tradeoff explicitly.
Code Highlights
def create_dataset(songs, noises, sample_duration=5000,
examples_per_song=50, noises_per_song_range=(5, 20)):
for song_name, song in songs:
for idx in range(examples_per_song):
song_fragment = random_fragment(song, duration=sample_duration) + 4
random_noises_count = np.random.randint(
noises_per_song_range[0], noises_per_song_range[1])
noise_fragment = None
for idx2 in range(random_noises_count):
random_noise_index = np.random.randint(len(noises))
noise_name, noise = noises[random_noise_index]
new_noise_fragment = random_fragment(noise, sample_duration)
if noise_fragment:
noise_fragment = noise_fragment.overlay(new_noise_fragment)
else:
noise_fragment = new_noise_fragment
song_fragment = song_fragment.overlay(noise_fragment)def file_to_spectrograms(self, wave_file, n_mels=128, len=256, hop_length=431):
y, sr = librosa.load(wave_file)
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=n_mels, hop_length=hop_length)
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
step = 51
for i in range(int((mel_spec_db.shape[1] - len) / step)):
yield mel_spec_db[:, i*step:len + i*step]
# Aggregate: sum probabilities across all windows
results = alex_net.predict(normalized_spectrograms)
result = [0] * len(classifier_dictionary3)
for r in results:
for (idx, prob) in enumerate(r):
result[idx] += probPerformance
AlexNet achieves 92.9% accuracy with 96.5% precision and 99.3% AUC in 32 epochs. ResNet-50 reaches 94.4% in 198 epochs — 6x more training for only 1.5% improvement. Dataset: 57 songs x 200 augmented fragments = 11,400 spectrograms (>3GB). Practical evaluation on 125 live smartphone recordings: 86.4% top-1, 96% top-5 accuracy.
Highlights
- Published IEEE-format research paper presenting a novel noise-augmentation technique for robust music identification — training for real-world degradation rather than clean-sample accuracy
- Novel problem formulation: N classes = N songs (not genre classification), requiring acoustic fingerprint-like feature learning with honest scalability analysis
- Two-tier validation methodology probing robustness boundaries: V1 confirms generalization (92.9%), V2 identifies the failure frontier — rigorous experimental design with honest negative results
- ResNet-50 built from scratch on custom mel-spectrogram inputs with a novel dropout modification empirically shown to stabilize training
- End-to-end system: iOS app → Django API → sliding-window ensemble inference achieves 86.4% top-1 accuracy on 125 real-world smartphone recordings
Related Projects
Expanding-Shrinking Network Model
A novel random graph model combining growth and shrinkage dynamics to produce networks with real-world properties
Novel ES model: at each step, either merge two random nodes (shrink) or split a random node inheriting all edges (expand) — controlled by a single α parameter
SwifyPy Numerical Mathematics Library
Protocol-oriented numerical computing in Swift with BLAS-accelerated matrix operations
Protocol-oriented numerical computing: six matrix types sharing a MatrixProtocol interface with compile-time BLAS dispatch through protocol witness tables