I work a ton with data. It's fun. But sometimes I need data that's just not readily available. Sure, maybe it exists on a website somewhere, but who has the time to sit and download data over and over. I started developing webscraping scripts a few months back when I realized I could collect all the data I needed with minimal manual work. As I've learned more and more ways to download increasingly complicated data, I've also learned how to make the process much faster.
One of the best methods I've discovered for quick and easy scraping is using Pandas DataFrames. I used to write code that used regex to loop through HTML and build tables, but that process can take forever. In this blog I'll show you how I scraped the Spotify Top 200 Charts in order to give you a quick run through of how you can use Pandas to speed up your webscraping.
I'll be walking through a script I wrote while working on a project where a music streaming company was suing a musician. In order to calculate damages we wanted to gather data on the artists playcounts so we could better quantify the value of being on the service.
You're only going to need three packages: Pandas, Requests, and Datetime. In case you're not familiar, Requests is really useful anytime you need to use Python to request information from a website. (See here for a useful guide).