I love music. I really do. I spend a ton of time listening to music on Spotify and YouTube (even Amazon when I have to). I listen to music at home, at work, at the gym, everywhere. One of my favourite things is discovering new music and one of my few claims to fame is having a ton of 8tracks followers back in the day when 8tracks was still a thing - for real, check my old account out. Just like me on 8tracks, Spotify also does a pretty good job of creating playlists with music I like, and YouTube's algorithm every so often will start pushing a song in my recommendations that I end up really liking. However, sometimes I'm just in a mood and I want to listen to a type of music. Spotify and YouTube kinda fail there. That got me thinking, what if I could take a whole bunch of songs and their audio features and then put those songs into playlists that have other songs with similar features. If I could do that, I could take the playlists and filter on the audio features I want. Luckily for everyone who cares about this, by using a k-means clustering algorithm and the Spotify API to download song features I was able to do just that. Throughout this article I'll be talking about playlists and clusters somewhat interchangeably, since by the end of this, each cluster will represent a unique playlist.

The first step was to decide on a list of songs I want to group into playlists. There is a massive universe of songs so I decided to only use weekly Billboard top 100 hits from 1998 to end of 2018. Once I had my list of songs, I used the Spotify API to get a list of song features. I developed a function that took the track name and artist name and then searched Spotify for the unique song ID which I could then use to get the music features. I immediately noticed that this function was failing a lot of the time. Upon further investigation, I found almost all of the problems were from songs featuring multiple artists - for example "Wolves by Marshmello featuring Selena Gomez" wouldn't get picked up by Spotify since Spotify had the song as "Wolves by Marshmello." Once I figured this out I simply removed any mention of featured artists and cleaned up weird symbols (if you're interested, you can check out the function). Once I did this, Spotify could identify over 90% of the songs from my list and attach a unique identifier code to them. Getting Spotify's unique identifier was crucial for getting the music features.

In [4]:

import pandas as pd
import numpy as np
import qgrid
import matplotlib.pyplot as plt
from IPython.display import IFrame, display
billboard_charts_2000_on = pd.read_csv('https://raw.githubusercontent.com/sampurkiss/song_features/master/song_details.csv')

So now I have a database of all songs that made it on Billboard's weekly top 100 and each songs audio features. Now it's time to do some data exploration. First off, I was curious if we can see any change over time in what songs are making the charts. I converted the key variable, currently just a number representing a key, to the name of the key and then plotted the relative number of each over time. The below chart shows the percentage of songs in each key over time. It's easy to see there are periods where certain keys are more popular and that some keys like D# are never that popular. For a bit of context on keys you can search for songs by key here.

In [5]:

keys = []
for i in billboard_charts_2000_on['key'].dropna():
    if i not in keys:
        keys.append(i)
keys.sort()
key_dict = {"0":"C","1":"C#","2":"D","3":"D#",
        "4":"E","5":"F","6":"F#","7":"G" ,"8":"G#",	
        "9":"A","10":'A#','11':'B'}
temp = billboard_charts_2000_on.copy()
temp['num_of_songs'] = 1
temp.loc[:,'key'] = temp.loc[:,'key'].map(lambda x: key_dict[str(int(x))])
temp = temp[['year','key','num_of_songs']].groupby(by =['year','key']).sum().reset_index()
temp = temp.pivot(index='year',columns = 'key',values = 'num_of_songs')
temp = temp.div(temp.sum(axis=1),axis=0)
temp.plot.bar(stacked=True, legend =False)
plt.title('Music Keys Over Time')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

Out[5]:

<matplotlib.legend.Legend at 0x20acaaf0cf8>

I wanted to look for more than just key changes because, let's be honest, even though the above graph is kind of interesting, it doesn't have a lot of practical meaning if you're not a professional musician. I also decided to look at average values of a few different features over time.

The first graph shows energy, danceability and valence. From Spotify: Energy is a measure from 0 to 1 and represents a measure of intensity and activity. High energy tracks feel loud and noisy, low energy tracks are the opposite. Danceability measures how suitable a song is for dancing based on several different measures. 1 means high danceability, 0 means no danceability. Valence describes musical cheerfulness of a song measured between 0 and 1. High valence means the song feels really positive and happy.

As you can see, valence has been steadily decreasing since 2000 which I guess means that popular music is getting steadily more depressing and angry over time? Perhaps we can blame Adele's Hello and most Billie Eilish songs for that. Oddly enough danceability and energy seem to move in opposite directions which would indicate that danceability isn't heavily tied to energy. As an economist, I was kinda hoping to see some movement around recessionary periods, eg. after 2008 in the peak of the recession, I thought popular music might have gotten more depressing, or valence might have initially decreased and then increases as the economy recovers. That doesn't appear to be happening but I think it's an interesting point for future research.

In [6]:

temp = billboard_charts_2000_on[['year','energy','danceability','valence']].groupby(by='year').mean()
temp.plot.line()
plt.xlabel('Year')
plt.xticks(np.arange(1998,2019,5))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

Out[6]:

<matplotlib.legend.Legend at 0x20acc10df28>

This next chart shows a few more features over time, mostly related to music/lyrical qualities. There are some interesting patterns here which I would encourage looking into. What's particularly interesting are the peaks and troughs.

In [7]:

temp = billboard_charts_2000_on[['year','acousticness','instrumentalness','liveness','speechiness']].groupby(by='year').mean()
temp.plot.line()
plt.xlabel('Year')
plt.xticks(np.arange(1998,2019,5))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))

Out[7]:

<matplotlib.legend.Legend at 0x20acadb5978>

Finally, I restricted the dataset to just the features I was interested in. I kept one dataset in pandas form and converted one to numpy to run the clustering algorithms on. That way, once I've assigned a cluster to each song I can attach the cluster number to the dataset and from there I can easily look through the data to see what songs have been clustered together. I included most features except for the key variable. When trying to create these playlists, it didn't really matter to me whether the songs were in the same key.

In [14]:

charts = billboard_charts_2000_on[['Song','Performer','track_id',
                              'danceability', 'energy',
                              'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo', 
                              'time_signature']].reset_index(drop=True)

#Because songs show up multiple times, must eliminate duplicates, also remove any nas.
charts = charts.drop_duplicates()
charts = charts.dropna()
name_matrix = charts

X = charts[['danceability', 'energy',
                              'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo', 
                              'time_signature']].copy()
X = np.matrix(X)

Now that I have my data table, I want to normalize all the features. In this case I'm normalizing to a mean 0 and dividing by the largest number in each feature range.

In [15]:

def normalize_values(X,use_sigma=True):
    """
    Normalizes the values of X. Can either normalize
    by dividing each feature vector by its respective standard 
    deviation or by the max value in the respective vector.
    Parameters:        
        X: a numpy matrix with dataset of interest
        use_sigma: whether to normalize using sigma
        (True) or by the largest number in the feature 
        vector (False)
    Returns:
        X_normalized: X values normalized based off the 
        decided setting
    """
    m,features = X.shape
    mu = sum(X)/m
    if use_sigma:
        divisor = np.sqrt(sum(np.square(X-mu))/m)
    else:
        divisor = np.max(X,axis=0)
    X_normalized = np.divide((X-mu),divisor)
    return X_normalized

X = normalize_values(X,use_sigma=False)

Now that my features are normalized I can run the clustering algorithm. The first step is choosing the number of clusters and initializing the clusters by picking points at random to start from. At this point I have over 6,000 songs to pick from so I decided to go with 300 clusters to make sure I have a good number of playlists that aren't too long. I can initialize the clusters by picking 300 different songs which I will use as starting points. When you're picking these songs, one of the most important things is to make sure you're picking without replacement (i.e. you're not picking the same song as a starting point more than once).

In [16]:

def initialize_centroids(X, k):
    """
    Used to select a starting point for centroids. Algorithm
    selects k data points from X and returns the centroid values 
    and location in the index
    paramaters:
        X: a numpy matrix with the dataset of interest
        k: number of centroids to initialize
    Returns:
        centroids, cluster_index
    """
    m,features = X.shape
    initial_centroids = np.random.choice(range(0,m),k,replace=False)
    centroids = X[initial_centroids]
    cluster_index = np.array(range(0,k))
    
    return centroids, cluster_index

#define number of clusters and generate starting point:
k = 300
centroids, cluster_allocation = initialize_centroids(X, k)

Now we can actually run the clustering process. I have two functions, one that calculates the next cluster location and one that calculates the cost function.

In [17]:

def generate_centroids(X, index_location_of_centroids, k):
    """
    Takes existing allocation of centroids and calculates the
    next centroid value. Specifically, it calculates the average
    location of all X values allocated to each centroid, and 
    returns that as the new centroid.
    Parameters:
        X: a numpy index of values
        index_location_of_centroids: an index of values that 
        indicates which centroid each row in X belongs to
        k: number of centroids
        
    Returns:
        cluster_each_x_belongs_to: the updated values 
        for index_location_of_centroids
        new_centroids: the new centroid values
        
    Note: initial_centroids should be organized in order of
    cluster it belongs to (e.g. centroid 0 should correspond
    to initial_centroids[0])
    """    
    #Generate variables:
    [m,features] = X.shape
    cluster_each_x_belongs_to =np.array([],dtype='int')
    new_centroids = np.zeros((k,features))
    
    for i in range(0,k):
        #get average "location" of each cluster
        new_centroids[i,:] = X[np.where(index_location_of_centroids==i),:].mean(axis=1)

    #Generate index value corresponding to the new centroid value
    for row in X:
        distance = np.sum(np.square(row-new_centroids),axis =1)
        #Get the index value of that closest value
        cluster = distance.argmin()
        #Take min distance and assign X value to that cluster
        cluster_each_x_belongs_to = np.append(cluster_each_x_belongs_to,cluster)
        
    return cluster_each_x_belongs_to, new_centroids

def k_means_cost(X,centroid_values,centroid_index_locations):
    """
    Calculates the cost function for the results of the clustering algorithm.
    """
    m,features = X.shape
    J = 1/m*np.square(X-centroid_values[centroid_index_locations]).sum()
    
    return J

J_list=[]    

for i in range(0,50):
    cluster_allocation,centroids = generate_centroids(X,cluster_allocation,k)
    J = k_means_cost(X,centroids,cluster_allocation)
    J_list.append(J)
    
group = name_matrix.loc[cluster_allocation==1]
#Attach the array containing each songs cluster grouping to the 
#original dataset 
name_matrix.insert(3, 'cluster_grouping', cluster_allocation)

We can plot how the cost function changed over each iteration to ensure it's decreasing. As you can see the cost function is consistently decreasing which is a good way to ensure the algorithm is working correctly.

In [18]:

plt.plot(J_list)
plt.xlabel('Number of iterations')
plt.ylabel('Cost')

Out[18]:

Text(0, 0.5, 'Cost')

And there you have it. Now I have a list of 6,000 songs and a number indicating which grouping/playlist each should belong to.

To start sorting through the features of interest, I summarized the groupings by average featues (e.g. average danceability, etc...). Feel free to sort through the summary table below to see what songs each playlist contains.

In [19]:

cluster_summary = name_matrix.groupby('cluster_grouping').mean().reset_index()
grid_widget = qgrid.show_grid(cluster_summary,show_toolbar=True)
display(grid_widget)

Now that you can see what each playlist roughly contains, you can sort through the main song list by playlist/cluster to see what songs are contained in each. You can also check out the full data set here.

In [20]:

name_matrix[['Song','Performer','cluster_grouping']]

Out[20]:

	Song	Performer	cluster_grouping
0	Don't You Worry Child	Swedish House Mafia Featuring John Martin	39
23	Don't	Billy Currington	1
42	Don't	Bryson Tiller	2
75	Don't	Ed Sheeran	3
110	Donald Trump	Mac Miller	4
113	Done For Me	Charlie Puth Featuring Kehlani	5
121	DONE.	The Band Perry	6
140	dontchange	Musiq	7
165	Doo Wop (That Thing)	Lauryn Hill	8
185	Dope	Lady Gaga	9
186	Dope	Tyga Featuring Rick Ross	10
195	Double Vision	3OH!3	11
197	Sure Be Cool If You Did	Blake Shelton	12
198	Call On Me	Janet & Nelly	13
199	The Baby	Blake Shelton	14
200	Blackout	Breathe Carolina	15
201	Born To Fly	Sara Evans	69
202	Better Days	Goo Goo Dolls	17
203	Already Gone	Sugarland	249
204	Small Town USA	Justin Moore	56
205	Rewind	Rascal Flatts	20
206	Aw Naw	Chris Young	83
207	All Over The Road	Easton Corbin	22
208	Point At You	Justin Moore	164
209	Anaconda	Nicki Minaj	24
210	Perfect	One Direction	25
211	The Jump Off	Lil' Kim Featuring Mr. Cheeks	288
212	Sometimes	Britney Spears	27
213	Smooth Criminal	Alien Ant Farm	28
214	Right Thru Me	Nicki Minaj	29
...	...	...	...
83892	You're Welcome	Dwayne Johnson	45
83901	You've Got A Way	Shania Twain	125
83913	You	Chris Young	207
83932	You	Jacquees	26
83935	You	Jesse Powell	16
83954	You	Lloyd Featuring Lil Wayne	123
83977	Young & Crazy	Frankie Ballard	294
83991	Young & Gettin' It	Meek Mill Featuring Kirko Bangz	137
83999	Young And Beautiful	Lana Del Rey	8
84018	Young Dumb & Broke	Khalid	85
84052	Young Girls	Bruno Mars	4
84065	Young, Wild & Free	Snoop Dogg & Wiz Khalifa Featuring Bruno Mars	143
84096	Young'n (Holla Back)	Fabolous	242
84115	Youngblood	5 Seconds Of Summer	292
84143	Young	Kenny Chesney	252
84162	Your Body Is A Wonderland	John Mayer	256
84190	Your Body	Christina Aguilera	191
84198	Your Body	Pretty Ricky	113
84217	Your Everything	Keith Urban	214
84232	Your Love Is My Drug	Ke$ha	20
84257	Your Love	Nicki Minaj	175
84276	Your Man	Josh Turner	163
84294	Yours If You Want It	Rascal Flatts	48
84303	Yours	Russell Dickerson	56
84321	Youth Of The Nation	P.O.D.	87
84339	Youth	Troye Sivan	197
84355	Zack And Codeine	Post Malone	292
84357	Zero	Chris Brown	282
84358	ZEZE	Kodak Black Featuring Travis Scott & Offset	37
84367	Zombie	Bad Wolves	151

6275 rows × 3 columns

Now there's a database with a list of songs and the playlist number it has been allocated to. From here, it's easy to take the unique song ID's and create a playlist on Spotify. In fact, that's just what I've done. If you look below you'll see a spotify playlist created based off one of these clusters.

In [17]:

IFrame(src='https://open.spotify.com/embed/playlist/44XNhDnuvKuw9DIM7GntNP', 
       width=700, height=600, allowtransparency="true", 
       allow="encrypted-media")

Out[17]:

I'm in the process of putting together an interactive app so you can create your own playlists by just inputting the code. If you're interested in that, keep your eyes peeled. I also built a simple function to do just that (with some modification).

Let me know what you think of these song groups. Do you think they make sense?

Sam's Blog

K-means Clustering to Create Custom Spotify Playlists

Comments !

Comments !

blogroll

social