Webscraping Made Easy With Pandas

I work a ton with data. It's fun. But sometimes I need data that's just not readily available. Sure, maybe it exists on a website somewhere, but who has the time to sit and download data over and over. I started developing webscraping scripts a few months back when I realized I could collect all the data I needed with minimal manual work. As I've learned more and more ways to download increasingly complicated data, I've also learned how to make the process much faster.

One of the best methods I've discovered for quick and easy scraping is using Pandas DataFrames. I used to write code that used regex to loop through HTML and build tables, but that process can take forever. In this blog I'll show you how I scraped the Spotify Top 200 Charts in order to give you a quick run through of how you can use Pandas to speed up your webscraping.

I'll be walking through a script I wrote while working on a project where a music streaming company was suing a musician. In order to calculate damages we wanted to gather data on the artists playcounts so we could better quantify the value of being on the service.

You're only going to need three packages: Pandas, Requests, and Datetime. In case you're not familiar, Requests is really useful anytime you need to use Python to request information from a website. (See here for a useful guide).

In [1]:

import pandas as pd
import requests
from datetime import timedelta
pd.set_option('display.max_rows', 20)

We can start by opening the link to the spotify charts for October 13, 2019 with the Requests package and getting an idea of what the HTML code looks like. Once you've opened the link, you can use the responses text attribute to view the HTML.

In [2]:

LINK = 'https://spotifycharts.com/regional/us/daily/2019-10-13'
response = requests.get(LINK)
response.text[0:2000]

Out[2]:

'<!doctype html>\n<html>\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n        <title>Spotify Charts</title>\n\n        <meta property="og:title" content="Spotify Charts">\n        <meta property="og:type" content="website">\n        <meta property="og:url" content="https://www.spotifycharts.com/">\n        <meta property="og:image" content="http://d2c87l0yth4zbw-2.global.ssl.fastly.net/i/_global/open-graph-default.png">\n        <meta property="og:description" content="Spotify is all the music you’ll ever need.">\n\n        <meta name="description" content="Spotify is all the music you’ll ever need.">\n        <meta name="viewport" content="width=device-width, initial-scale=1.0">\n        <link rel="apple-touch-icon-precomposed" sizes="144x144" href="/images/touch-icon-144.png">\n        <link rel="apple-touch-icon-precomposed" sizes="114x114" href="/images/touch-icon-114.png">\n        <link rel="apple-touch-icon-precomposed" sizes="72x72" href="/images/touch-icon-72.png">\n        <link rel="apple-touch-icon-precomposed" href="/images/touch-icon-57.png">\n        <link rel="stylesheet" type="text/css" href="/css/app.css">\n        <script>\n            window.log = function () {\n                log.history = log.history || [];\n                log.history.push(arguments);\n                if (this.console) {\n                    console.log(Array.prototype.slice.call(arguments))\n                }\n            };\n        </script>\n        <script>\n            var ROUTE_PARAMS = {type: \'regional\', country: \'us\', recurrence: \'daily\', date: \'2019-10-13\'};\n        </script>\n    </head>\n    <body>\n        <!--[if lt IE 9]>\n          <p class="browsehappy">You are using an <strong>outdated</strong> browser. Please <a href="http://browsehappy.com/">upgrade your browser</a> to improve your experience.</p>\n        <![endif]-->\n\n            <div id="content">\n        <div>\n            <header class="header">\n                <div '

The HTML alone isn't really that useful, except maybe to show that we're able to download the text. What we want to be able to do is download the table embedded in the HTML that has artist, song, and play count information that we can see below. You can check the HTML code on the website to confirm existence of the table tag. Pandas will look for that tag and will pull the HTML code underneath it into table form.

Luckily, now that we have the HTML, it's really easy to pull this into a DataFrame. All we need to do is run the HTML text through the Pandas read_html function and we will have a table with all the information we need. It's important to convert the response output (or whichever program you're using to scrape the webpage) to readable HTML first.

In [3]:

df = pd.read_html(response.text)
df

Out[3]:

[     Unnamed: 0  Unnamed: 1  Unnamed: 2  \
 0           NaN           1         NaN   
 1           NaN           2         NaN   
 2           NaN           3         NaN   
 3           NaN           4         NaN   
 4           NaN           5         NaN   
 5           NaN           6         NaN   
 6           NaN           7         NaN   
 7           NaN           8         NaN   
 8           NaN           9         NaN   
 9           NaN          10         NaN   
 ..          ...         ...         ...   
 190         NaN         191         NaN   
 191         NaN         192         NaN   
 192         NaN         193         NaN   
 193         NaN         194         NaN   
 194         NaN         195         NaN   
 195         NaN         196         NaN   
 196         NaN         197         NaN   
 197         NaN         198         NaN   
 198         NaN         199         NaN   
 199         NaN         200         NaN   
 
                                                  Track  Streams  
 0                 HIGHEST IN THE ROOM  by Travis Scott  1611989  
 1                              Circles  by Post Malone  1189990  
 2    Bandit (with YoungBoy Never Broke Again)  by J...  1174811  
 3                           Lights Up  by Harry Styles  1089354  
 4     10,000 Hours (with Justin Bieber)  by Dan + Shay   918608  
 5                                 Ransom  by Lil Tecca   859449  
 6                223's (feat. 9lokknine)  by YNW Melly   857946  
 7                                Truth Hurts  by Lizzo   846531  
 8                        hot girl bummer  by blackbear   802471  
 9                     Hot (feat. Gunna)  by Young Thug   754197  
 ..                                                 ...      ...  
 190                       ocean eyes  by Billie Eilish   191494  
 191                              I Like It  by Cardi B   190787  
 192  Costa Rica (with Bas & JID feat. Guapdad 4000,...   190266  
 193                      Only Human  by Jonas Brothers   190134  
 194             Post Malone (feat. RANI)  by Sam Feldt   189754  
 195                                Location  by Khalid   189207  
 196  Mixed Personalities (feat. Kanye West)  by YNW...   188963  
 197         Ric Flair Drip (& Metro Boomin)  by Offset   188772  
 198  Meant to Be (feat. Florida Georgia Line)  by B...   188597  
 199     Routine (feat. Rick Ross & Meek Mill)  by Wale   187964  
 
 [200 rows x 5 columns]]

Note that the output is actually in list form. Often it's the case that Pandas will read in more than one table from the HTML code so you need to think about which table you want and then how to pull that table from the list and/or how to save each table to individual variables. Here it's easy, we can just take the 0th value from the list.

In [4]:

df = df[0]

Now that we have the data in a DataFrame we need to clean it up. What I want to do is have track title and artist in seperate columns, drop empty columns, and properly name all remaining columns. We should also add a date column so when we look at the table again we know what date it's for. Fortunately, there's not much more cleaning required than that.

In [5]:

df.dropna(how='all', inplace=True, axis=1)
trackname_artist = df['Track'].str.split(' by ', expand = True)
df['Song'] = trackname_artist[0]
df['Artist'] = trackname_artist[1]
df.rename(columns ={'Unnamed: 1': 'Position',
								'Track': 'Full Name'}, inplace = True)
df['Date'] = '2019-10-13'
df.head()

Out[5]:

	Position	Full Name	Streams	Song	Artist	Date
0	1	HIGHEST IN THE ROOM by Travis Scott	1611989	HIGHEST IN THE ROOM	Travis Scott	2019-10-13
1	2	Circles by Post Malone	1189990	Circles	Post Malone	2019-10-13
2	3	Bandit (with YoungBoy Never Broke Again) by J...	1174811	Bandit (with YoungBoy Never Broke Again)	Juice WRLD	2019-10-13
3	4	Lights Up by Harry Styles	1089354	Lights Up	Harry Styles	2019-10-13
4	5	10,000 Hours (with Justin Bieber) by Dan + Shay	918608	10,000 Hours (with Justin Bieber)	Dan + Shay	2019-10-13

Putting it all together:¶

Having one day of data may not be enough, and in the case of my project, we needed 40 days worth. The easiest way to get 40 days of data (which requires scraping 40 different webpages) is to make a flexible function and loop through it, which I've done below. If you do want to scrape the top 200, this function will work well. You just need to specify the date you need the data and which region you want the data for.

In [6]:

def get_daily_chart_data(date, region ='us'):
	"""
	For a given region and given date, pulls that days top 200 songs 
	and number of plays
	date: date in "yyyy-mm-dd" format
	region: any that shows up in link when changed here: https://spotifycharts.com/regional
	"""
	LINK = 'https://spotifycharts.com/regional/%s/daily/%s' %(region, date)
	response = requests.get(LINK)
	response = response.text
	table_data = pd.read_html(response)[0]
	table_data.dropna(how='all', inplace=True, axis=1)
	trackname_artist = table_data['Track'].str.split(' by ', expand = True)
	table_data['Song'] = trackname_artist[0]
	table_data['Artist'] = trackname_artist[1]
	table_data.rename(columns ={'Unnamed: 1': 'position',
								'Track': 'Full Track Name'}, inplace = True)
	table_data['Date'] = date
	return table_data

Finally, we can define a few variables and run the loop. Looping through dates can be a bit weird because you should be starting with datetime formatted dates so you can easily increase the date value but then you need to convert back to string so you can use the date in the function.

In [7]:

START_DATE = '2018-10-01'
START_DATE = pd.to_datetime(START_DATE)
NUMBER_OF_DAYS = 10

In [8]:

if __name__ == "__main__":
	songs = pd.DataFrame()
	for i in range(0, NUMBER_OF_DAYS):
		date = START_DATE.date().strftime('%Y-%m-%d')
		temp = get_daily_chart_data(date)
		songs = pd.concat([songs, temp])
		START_DATE = START_DATE + timedelta(days=1)
		
	songs.reset_index(drop=True, inplace=True)

We now have 10 days of Spotify top 200 data that we can play around with. For example, we could look at which artist had the most number of singles during that time period.

In [9]:

(songs.groupby(by= ['Artist', 'Song']).size().reset_index()
 .groupby(by='Artist')['Song'].count().reset_index()
 .sort_values('Song', ascending = False)).head(10)

Out[9]:

	Artist	Song
64	Lil Wayne	23
58	Lil Baby	15
101	Twenty One Pilots	14
66	Logic	12
105	XXXTENTACION	10
29	Eminem	8
82	Post Malone	8
24	Drake	7
46	Juice WRLD	6
99	Travis Scott	6

The top artist during that time period was Lil Wayne with 23 (!!!) singles in the top 200 over the selected 10 days. This isn't a huge surprise since Lil Wayne released his album Tha Carter V at the end of September. What is a surprise is how the top 10 artists with the most different songs on the top 200 are hip-hop(ish). The rise of hip-hop is definitely an area I've been interested in and will explore more in the future.

We can also see what songs were the most streamed in this time period.

In [10]:

songs.groupby(by ='Song')['Streams'].sum().reset_index().sort_values(by = 'Streams', ascending=False).head(10)

Out[10]:

	Song	Streams
128	Mona Lisa (feat. Kendrick Lamar)	15891537
182	SICKO MODE	13861692
126	Mo Bamba	12250699
84	I Love It (& Lil Pump)	12091761
61	Falling Down - Bonus Track	11666267
47	Don't Cry (feat. XXXTENTACION)	11386908
118	Lucid Dreams	11324346
72	Happier	11094688
204	Taki Taki (with Selena Gomez, Ozuna & Cardi B)	10696516
226	Uproar	9975512

If there's one thing to take away from this it's that the top 10 most streamed songs during this time period were BANGERS.

We could even make a quick chart to see how many streams the top 200 songs received on any given day. October 5th is a Friday if you're wondering why there's a big spike on that day.

In [13]:

chart = songs.groupby(by ='Date')['Streams'].sum().reset_index().plot.bar(x = 'Date', y='Streams')

Hopefully by now you have a good idea of how you can get started with webscraping. It's much less daunting than I initially thought and hopefully you feel the same way.

Since learning webscraping I've been tasked with scraping websites that really don't want me to scrape. These websites use all sorts of tricks to try and stop me, but with a few lines of code, I've learned to trick these website. I've also learned a lot about how to amp up webscraping capabilities by using browser automation which will be a topic for next time.

There are comments.

blogroll

social