K-means Clustering to Create Custom Spotify Playlists

I love music. I really do. I spend a ton of time listening to music on Spotify and YouTube (even Amazon when I have to). I listen to music at home, at work, at the gym, everywhere. One of my favourite things is discovering new music and one of my few claims to fame is having a ton of 8tracks followers back in the day when 8tracks was still a thing - for real, check my old account out. Just like me on 8tracks, Spotify also does a pretty good job of creating playlists with music I like, and YouTube's algorithm every so often will start pushing a song in my recommendations that I end up really liking. However, sometimes I'm just in a mood and I want to listen to a type of music. Spotify and YouTube kinda fail there. That got me thinking, what if I could take a whole bunch of songs and their audio features and then put those songs into playlists that have other songs with similar features. If I could do that, I could take the playlists and filter on the audio features I want. Luckily for everyone who cares about this, by using a k-means clustering algorithm and the Spotify API to download song features I was able to do just that. Throughout this article I'll be talking about playlists and clusters somewhat interchangeably, since by the end of this, each cluster will represent a unique playlist.

The first step was to decide on a list of songs I want to group into playlists. There is a massive universe of songs so I decided to only use weekly Billboard top 100 hits from 1998 to end of 2018. Once I had my list of songs, I used the Spotify API to get a list of song features. I developed a function that took the track name and artist name and then searched Spotify for the unique song ID which I could then use to get the music features. I immediately noticed that this function was failing a lot of the time. Upon further investigation, I found almost all of the problems were from songs featuring multiple artists - for example "Wolves by Marshmello featuring Selena Gomez" wouldn't get picked up by Spotify since Spotify had the song as "Wolves by Marshmello." Once I figured this out I simply removed any mention of featured artists and cleaned up weird symbols (if you're interested, you can check out the function). Once I did this, Spotify could identify over 90% of the songs from my list and attach a unique identifier code to them. Getting Spotify's unique identifier was crucial for getting the music features.

In [4]:
import pandas as pd
import numpy as np
import qgrid
import matplotlib.pyplot as plt
from IPython.display import IFrame, display
billboard_charts_2000_on = pd.read_csv('https://raw.githubusercontent.com/sampurkiss/song_features/master/song_details.csv')

So now I have a database of all songs that made it on Billboard's weekly top 100 and each songs audio features. Now it's time to do some data exploration. First off, I was curious if we can see any change over time in what songs are making the charts. I converted the key variable, currently just a number representing a key, to the name of the key and then plotted the relative number of each over time. The below chart shows the percentage of songs in each key over time. It's easy to see there are periods where certain keys are more popular and that some keys like D# are never that popular. For a bit of context on keys you can search for songs by key here.

In [5]:
keys = []
for i in billboard_charts_2000_on['key'].dropna():
    if i not in keys:
        keys.append(i)
keys.sort()
key_dict = {"0":"C","1":"C#","2":"D","3":"D#",
        "4":"E","5":"F","6":"F#","7":"G" ,"8":"G#",	
        "9":"A","10":'A#','11':'B'}
temp = billboard_charts_2000_on.copy()
temp['num_of_songs'] = 1
temp.loc[:,'key'] = temp.loc[:,'key'].map(lambda x: key_dict[str(int(x))])
temp = temp[['year','key','num_of_songs']].groupby(by =['year','key']).sum().reset_index()
temp = temp.pivot(index='year',columns = 'key',values = 'num_of_songs')
temp = temp.div(temp.sum(axis=1),axis=0)
temp.plot.bar(stacked=True, legend =False)
plt.title('Music Keys Over Time')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
Out[5]:
<matplotlib.legend.Legend at 0x20acaaf0cf8>

I wanted to look for more than just key changes because, let's be honest, even though the above graph is kind of interesting, it doesn't have a lot of practical meaning if you're not a professional musician. I also decided to look at average values of a few different features over time.

The first graph shows energy, danceability and valence. From Spotify: Energy is a measure from 0 to 1 and represents a measure of intensity and activity. High energy tracks feel loud and noisy, low energy tracks are the opposite. Danceability measures how suitable a song is for dancing based on several different measures. 1 means high danceability, 0 means no danceability. Valence describes musical cheerfulness of a song measured between 0 and 1. High valence means the song feels really positive and happy.

As you can see, valence has been steadily decreasing since 2000 which I guess means that popular music is getting steadily more depressing and angry over time? Perhaps we can blame Adele's Hello and most Billie Eilish songs for that. Oddly enough danceability and energy seem to move in opposite directions which would indicate that danceability isn't heavily tied to energy. As an economist, I was kinda hoping to see some movement around recessionary periods, eg. after 2008 in the peak of the recession, I thought popular music might have gotten more depressing, or valence might have initially decreased and then increases as the economy recovers. That doesn't appear to be happening but I think it's an interesting point for future research.

In [6]:
temp = billboard_charts_2000_on[['year','energy','danceability','valence']].groupby(by='year').mean()
temp.plot.line()
plt.xlabel('Year')
plt.xticks(np.arange(1998,2019,5))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
Out[6]:
<matplotlib.legend.Legend at 0x20acc10df28>

This next chart shows a few more features over time, mostly related to music/lyrical qualities. There are some interesting patterns here which I would encourage looking into. What's particularly interesting are the peaks and troughs.

In [7]:
temp = billboard_charts_2000_on[['year','acousticness','instrumentalness','liveness','speechiness']].groupby(by='year').mean()
temp.plot.line()
plt.xlabel('Year')
plt.xticks(np.arange(1998,2019,5))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
Out[7]:
<matplotlib.legend.Legend at 0x20acadb5978>

Finally, I restricted the dataset to just the features I was interested in. I kept one dataset in pandas form and converted one to numpy to run the clustering algorithms on. That way, once I've assigned a cluster to each song I can attach the cluster number to the dataset and from there I can easily look through the data to see what songs have been clustered together. I included most features except for the key variable. When trying to create these playlists, it didn't really matter to me whether the songs were in the same key.

In [14]:
charts = billboard_charts_2000_on[['Song','Performer','track_id',
                              'danceability', 'energy',
                              'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo', 
                              'time_signature']].reset_index(drop=True)

#Because songs show up multiple times, must eliminate duplicates, also remove any nas.
charts = charts.drop_duplicates()
charts = charts.dropna()
name_matrix = charts

X = charts[['danceability', 'energy',
                              'key', 'loudness', 'mode', 'speechiness', 'acousticness',
                              'instrumentalness', 'liveness', 'valence', 'tempo', 
                              'time_signature']].copy()
X = np.matrix(X)

Now that I have my data table, I want to normalize all the features. In this case I'm normalizing to a mean 0 and dividing by the largest number in each feature range.

In [15]:
def normalize_values(X,use_sigma=True):
    """
    Normalizes the values of X. Can either normalize
    by dividing each feature vector by its respective standard 
    deviation or by the max value in the respective vector.
    Parameters:        
        X: a numpy matrix with dataset of interest
        use_sigma: whether to normalize using sigma
        (True) or by the largest number in the feature 
        vector (False)
    Returns:
        X_normalized: X values normalized based off the 
        decided setting
    """
    m,features = X.shape
    mu = sum(X)/m
    if use_sigma:
        divisor = np.sqrt(sum(np.square(X-mu))/m)
    else:
        divisor = np.max(X,axis=0)
    X_normalized = np.divide((X-mu),divisor)
    return X_normalized

X = normalize_values(X,use_sigma=False)

Now that my features are normalized I can run the clustering algorithm. The first step is choosing the number of clusters and initializing the clusters by picking points at random to start from. At this point I have over 6,000 songs to pick from so I decided to go with 300 clusters to make sure I have a good number of playlists that aren't too long. I can initialize the clusters by picking 300 different songs which I will use as starting points. When you're picking these songs, one of the most important things is to make sure you're picking without replacement (i.e. you're not picking the same song as a starting point more than once).

In [16]:
def initialize_centroids(X, k):
    """
    Used to select a starting point for centroids. Algorithm
    selects k data points from X and returns the centroid values 
    and location in the index
    paramaters:
        X: a numpy matrix with the dataset of interest
        k: number of centroids to initialize
    Returns:
        centroids, cluster_index
    """
    m,features = X.shape
    initial_centroids = np.random.choice(range(0,m),k,replace=False)
    centroids = X[initial_centroids]
    cluster_index = np.array(range(0,k))
    
    return centroids, cluster_index

#define number of clusters and generate starting point:
k = 300
centroids, cluster_allocation = initialize_centroids(X, k)  

Now we can actually run the clustering process. I have two functions, one that calculates the next cluster location and one that calculates the cost function.

In [17]:
def generate_centroids(X, index_location_of_centroids, k):
    """
    Takes existing allocation of centroids and calculates the
    next centroid value. Specifically, it calculates the average
    location of all X values allocated to each centroid, and 
    returns that as the new centroid.
    Parameters:
        X: a numpy index of values
        index_location_of_centroids: an index of values that 
        indicates which centroid each row in X belongs to
        k: number of centroids
        
    Returns:
        cluster_each_x_belongs_to: the updated values 
        for index_location_of_centroids
        new_centroids: the new centroid values
        
    Note: initial_centroids should be organized in order of
    cluster it belongs to (e.g. centroid 0 should correspond
    to initial_centroids[0])
    """    
    #Generate variables:
    [m,features] = X.shape
    cluster_each_x_belongs_to =np.array([],dtype='int')
    new_centroids = np.zeros((k,features))
    
    for i in range(0,k):
        #get average "location" of each cluster
        new_centroids[i,:] = X[np.where(index_location_of_centroids==i),:].mean(axis=1)

    #Generate index value corresponding to the new centroid value
    for row in X:
        distance = np.sum(np.square(row-new_centroids),axis =1)
        #Get the index value of that closest value
        cluster = distance.argmin()
        #Take min distance and assign X value to that cluster
        cluster_each_x_belongs_to = np.append(cluster_each_x_belongs_to,cluster)
        
    return cluster_each_x_belongs_to, new_centroids

def k_means_cost(X,centroid_values,centroid_index_locations):
    """
    Calculates the cost function for the results of the clustering algorithm.
    """
    m,features = X.shape
    J = 1/m*np.square(X-centroid_values[centroid_index_locations]).sum()
    
    return J

J_list=[]    

for i in range(0,50):
    cluster_allocation,centroids = generate_centroids(X,cluster_allocation,k)
    J = k_means_cost(X,centroids,cluster_allocation)
    J_list.append(J)
    
group = name_matrix.loc[cluster_allocation==1]
#Attach the array containing each songs cluster grouping to the 
#original dataset 
name_matrix.insert(3, 'cluster_grouping', cluster_allocation)

We can plot how the cost function changed over each iteration to ensure it's decreasing. As you can see the cost function is consistently decreasing which is a good way to ensure the algorithm is working correctly.

In [18]:
plt.plot(J_list)
plt.xlabel('Number of iterations')
plt.ylabel('Cost')
Out[18]:
Text(0, 0.5, 'Cost')

And there you have it. Now I have a list of 6,000 songs and a number indicating which grouping/playlist each should belong to.

To start sorting through the features of interest, I summarized the groupings by average featues (e.g. average danceability, etc...). Feel free to sort through the summary table below to see what songs each playlist contains.

In [19]:
cluster_summary = name_matrix.groupby('cluster_grouping').mean().reset_index()
grid_widget = qgrid.show_grid(cluster_summary,show_toolbar=True)
display(grid_widget)

Now that you can see what each playlist roughly contains, you can sort through the main song list by playlist/cluster to see what songs are contained in each. You can also check out the full data set here.

In [20]:
name_matrix[['Song','Performer','cluster_grouping']]
Out[20]:
Song Performer cluster_grouping
0 Don't You Worry Child Swedish House Mafia Featuring John Martin 39
23 Don't Billy Currington 1
42 Don't Bryson Tiller 2
75 Don't Ed Sheeran 3
110 Donald Trump Mac Miller 4
113 Done For Me Charlie Puth Featuring Kehlani 5
121 DONE. The Band Perry 6
140 dontchange Musiq 7
165 Doo Wop (That Thing) Lauryn Hill 8
185 Dope Lady Gaga 9
186 Dope Tyga Featuring Rick Ross 10
195 Double Vision 3OH!3 11
197 Sure Be Cool If You Did Blake Shelton 12
198 Call On Me Janet & Nelly 13
199 The Baby Blake Shelton 14
200 Blackout Breathe Carolina 15
201 Born To Fly Sara Evans 69
202 Better Days Goo Goo Dolls 17
203 Already Gone Sugarland 249
204 Small Town USA Justin Moore 56
205 Rewind Rascal Flatts 20
206 Aw Naw Chris Young 83
207 All Over The Road Easton Corbin 22
208 Point At You Justin Moore 164
209 Anaconda Nicki Minaj 24
210 Perfect One Direction 25
211 The Jump Off Lil' Kim Featuring Mr. Cheeks 288
212 Sometimes Britney Spears 27
213 Smooth Criminal Alien Ant Farm 28
214 Right Thru Me Nicki Minaj 29
... ... ... ...
83892 You're Welcome Dwayne Johnson 45
83901 You've Got A Way Shania Twain 125
83913 You Chris Young 207
83932 You Jacquees 26
83935 You Jesse Powell 16
83954 You Lloyd Featuring Lil Wayne 123
83977 Young & Crazy Frankie Ballard 294
83991 Young & Gettin' It Meek Mill Featuring Kirko Bangz 137
83999 Young And Beautiful Lana Del Rey 8
84018 Young Dumb & Broke Khalid 85
84052 Young Girls Bruno Mars 4
84065 Young, Wild & Free Snoop Dogg & Wiz Khalifa Featuring Bruno Mars 143
84096 Young'n (Holla Back) Fabolous 242
84115 Youngblood 5 Seconds Of Summer 292
84143 Young Kenny Chesney 252
84162 Your Body Is A Wonderland John Mayer 256
84190 Your Body Christina Aguilera 191
84198 Your Body Pretty Ricky 113
84217 Your Everything Keith Urban 214
84232 Your Love Is My Drug Ke$ha 20
84257 Your Love Nicki Minaj 175
84276 Your Man Josh Turner 163
84294 Yours If You Want It Rascal Flatts 48
84303 Yours Russell Dickerson 56
84321 Youth Of The Nation P.O.D. 87
84339 Youth Troye Sivan 197
84355 Zack And Codeine Post Malone 292
84357 Zero Chris Brown 282
84358 ZEZE Kodak Black Featuring Travis Scott & Offset 37
84367 Zombie Bad Wolves 151

6275 rows × 3 columns

Now there's a database with a list of songs and the playlist number it has been allocated to. From here, it's easy to take the unique song ID's and create a playlist on Spotify. In fact, that's just what I've done. If you look below you'll see a spotify playlist created based off one of these clusters.

In [17]:
IFrame(src='https://open.spotify.com/embed/playlist/44XNhDnuvKuw9DIM7GntNP', 
       width=700, height=600, allowtransparency="true", 
       allow="encrypted-media")
Out[17]:

I'm in the process of putting together an interactive app so you can create your own playlists by just inputting the code. If you're interested in that, keep your eyes peeled. I also built a simple function to do just that (with some modification).

Let me know what you think of these song groups. Do you think they make sense?

Comments !

blogroll

social