This paper predicts hit songs based on musical features from MIDI files. The task is modeled as a binary classification problem optimizing for precision, with Billboard ranking as labels. Million Song Dataset (MSD) is inspected audibly, visually, and with a logistic regression model. MSD features is determined too noisy for the task. MIDI files encodes pitch duration as separate instrument tracks, and is chosen over MSD. Fine-grained instrument, melody, and beats features are extracted. Language models of n-grams are used to transform raw musical features into word-document frequency matrices. Logistic Regression is chosen as the classifier, with increased probability cutoff to optimize for precision. An ensemble method that uses both instruments/ melody as well as beats features produces the peak precision 0.882 at probability cutoff 0.998 (recall is 0.279). Alternative models and applications are discussed.

Paper: Predicting Hit Songs with MIDI Musical Features