Predicting which songs from Triple J's Unearthed DAB+ radio station will be played on Triple J.
Triple J Unearthed is the Australian music discovery initiative of Triple J. It has kicked off the careers of thousands of musicians, and hosts well over 120,000 tracks on its website. Triple J Unearthed has a full time digital radio station playing music selected from the website. One of the most important transitions an Australian musician can make is to transition from being played on Triple J Unearthed’s digital station to the ‘main’ Triple J station. This model is designed to predict which songs are going to make that transition.
I’ve written some code to automatically collect playlist data each week and gather as much information as I can about the artists included in that playlist. The result is a database including information like the time of day the song is being played, the genre of the artist, the length of the song, the hometown of the artist, the Facebook and Twitter audience of the artist and many more.
Each week when this data is collected it is added to the existing dataset. Then a machine learning algorithm is trained with validated data from up to one month ago (I also check which songs did end up being played on Triple J during the previous step). This makes the tool smarter week by week.
Once the model is trained, it selects songs from the Triple J Unearthed digital station's playlist that recieved their first play recently and predicts which will of those will be played on Triple J. It then sends those predictions to this site. Here are the most recent of those predictions:
As well as predictions, the model also tells us how accurate it was during its training. Here is the latest report of its accuracy:
While making this weekly loop of collection and prediction might have been incredibly ineresting to work through, it's the learnings that come from the various models' behaviour in testing that are the most rewarding. I was able to examine the data points that the models learned to let influence them. This information could be applied to other tools, useful for record label A&R in the future.
The code runs on a stock Raspberry Pi 2 Model B sitting on a shelf in my home. It has a 900MHz quad-core ARM Cortex-A7 CPU and 1GB RAM. This credit card-sized computer is able to handle the process of gathering data, cleaning it, training the model and making predictions. The process is initiated by Cron.
The project was written entirely with Python. Here are a selection of the libraries used:
• Sci-Kit Learn for building the machine learning model.
• PyCaret for initial machine learning model comparisons.
• BeautifulSoup / Selenium / Requests for web scraping.
• Pandas for data frame management.
• FTPLib for transferring files to this web server.
• NumPy / SciPy for crunching numbers.
The model I have used in the current build of this project is a Histogram-based Gradient Boosting Classification Tree. This model is a 'faster' implementation of a Gradient Boosting Classifier. For further reading on this, check the references at the bottom of this page.
Below are some graphics generated while building and training initial models during the early stages of this project.
I’ve collected a few resources for further reading:
• Triple J Unearthed on Wikipedia
• Cron on Wikipedia.
• MachineLearningMastery article on Gradient Boosting methods.
• Sci-Kit Learn documentation for HistGradientBoostingClassifier.
• Raspberry Pi 2 Model B product page.
• Article by Rahul Nayak about web scraping with Silenium and BeautifulSoup.
• PyCaret documentation regarding comparing classification models.
• TowardsDataScience article about understanding models with SHAP value.
© Copyright 2020 Tom Gilmore. All Rights Reserved.