Skip to main content
2021 Python TensorFlow/Keras LaTeX Swift Django

Music Classification with Deep Learning

IEEE-format research paper with novel noise-augmentation for robust music identification

Overview

An IEEE-format research paper presenting a novel approach to music identification using CNNs, with a full end-to-end implementation. The core contribution is a noise-augmentation training strategy that overlays 5-20 ambient noise sources per sample for real-world robustness, combined with a per-song classification formulation (57 songs = 57 classes) that is fundamentally harder than genre classification. Three architectures are compared (fully-connected baseline, AlexNet, ResNet-50), validated by a smartphone-based practical evaluation achieving 86.4% top-1 and 96% top-5 accuracy on live recordings.

Architecture

Music Classification with Deep Learning architecture

The system is a full ML pipeline: raw audio is fragmented into 5-second clips, augmented with 5-20 randomly-selected ambient noise layers, converted to mel-spectrograms (128x256 via a precisely computed hop_length=431), and fed into CNN classifiers. Inference uses a sliding-window ensemble that aggregates probability distributions across overlapping time windows, creating a multi-vote ensemble from a single model. The Django API accepts smartphone recordings via ngrok and returns top-5 predictions.

Key Concepts

Multi-Noise Augmentation

Multi-Noise Augmentation

Training strategy that layers 5-20 randomly-selected ambient noise segments (city traffic, rain, pub, wind, etc.) on each sample with random amplitude shifts. The song fragment gets a +4dB boost before noise, and noises are composited iteratively to create realistic ambient profiles.

CNN Architecture Comparison

CNN Architecture Comparison

Systematic comparison of three architectures — fully-connected baseline (78.3%), AlexNet with LeakyReLU/BatchNorm (92.9% in 32 epochs), and ResNet-50 built from scratch with a novel dropout layer (94.4% in 198 epochs). Identifies the accuracy-vs-efficiency tradeoff explicitly.

Code Highlights

Noise-Layering Augmentation Algorithm
def create_dataset(songs, noises, sample_duration=5000,
                    examples_per_song=50, noises_per_song_range=(5, 20)):
    for song_name, song in songs:
        for idx in range(examples_per_song):
            song_fragment = random_fragment(song, duration=sample_duration) + 4
            random_noises_count = np.random.randint(
                noises_per_song_range[0], noises_per_song_range[1])
            noise_fragment = None
            for idx2 in range(random_noises_count):
                random_noise_index = np.random.randint(len(noises))
                noise_name, noise = noises[random_noise_index]
                new_noise_fragment = random_fragment(noise, sample_duration)
                if noise_fragment:
                    noise_fragment = noise_fragment.overlay(new_noise_fragment)
                else:
                    noise_fragment = new_noise_fragment
            song_fragment = song_fragment.overlay(noise_fragment)
Sliding-Window Ensemble Inference
def file_to_spectrograms(self, wave_file, n_mels=128, len=256, hop_length=431):
    y, sr = librosa.load(wave_file)
    mel_spec = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=n_mels, hop_length=hop_length)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    step = 51
    for i in range(int((mel_spec_db.shape[1] - len) / step)):
        yield mel_spec_db[:, i*step:len + i*step]

# Aggregate: sum probabilities across all windows
results = alex_net.predict(normalized_spectrograms)
result = [0] * len(classifier_dictionary3)
for r in results:
    for (idx, prob) in enumerate(r):
        result[idx] += prob

Performance

Music Classification with Deep Learning performance chart

AlexNet achieves 92.9% accuracy with 96.5% precision and 99.3% AUC in 32 epochs. ResNet-50 reaches 94.4% in 198 epochs — 6x more training for only 1.5% improvement. Dataset: 57 songs x 200 augmented fragments = 11,400 spectrograms (>3GB). Practical evaluation on 125 live smartphone recordings: 86.4% top-1, 96% top-5 accuracy.

Highlights

  • Published IEEE-format research paper presenting a novel noise-augmentation technique for robust music identification — training for real-world degradation rather than clean-sample accuracy
  • Novel problem formulation: N classes = N songs (not genre classification), requiring acoustic fingerprint-like feature learning with honest scalability analysis
  • Two-tier validation methodology probing robustness boundaries: V1 confirms generalization (92.9%), V2 identifies the failure frontier — rigorous experimental design with honest negative results
  • ResNet-50 built from scratch on custom mel-spectrogram inputs with a novel dropout modification empirically shown to stabilize training
  • End-to-end system: iOS app → Django API → sliding-window ensemble inference achieves 86.4% top-1 accuracy on 125 real-world smartphone recordings