Every data analyst out there knows that the majority of our time is spent manipulating and cleaning data much more so than building visualizations. In my most recent post: Icons - Mario: The Face of Video Games this was no exception. As part of my analysis, I had to extract all the video games Mario had been a part of, so today I will show you a quick and easy way to do this using Pandas.
The first thing to do is to make sure you have all the required libraries for this to work. This method relies on lxml, BeautifulSoup, and the html5lib libraries to parse the HTML page, so make sure to install them if you haven’t done so already.
pip install lxml beautifulsoup4 html5lib
Next, identify a website you want to extract the data from, let’s use the List of video games featuring Mario Wikipedia entry as an example.
import pandas as pd
url = ‘https://en.wikipedia.org/wiki/List_of_video_games_featuring_Mario’
tables = pd.read_html(url)
print(len(tables))
# CONTINUE YOUR ANALYSIS HERE
and that’s literally it.
The chunk of code above will print how many tables pandas was able to parse from the url given. From there you can make a Pandas DataFrame out of the entry you want and continue your analysis.
Pretty neat huh? I like to use this method whenever I’m parsing well-defined and structured data.
Keep in mind that some websites might give you a 403 error, in which case they’re blocking parsers so it’s probably better to use an API or another method.
Regardless, here are a couple websites you can try this technique on:
Happy analysis!
Thanks For Reading
What type of analysis are you currently working on? Drop a comment and let’s collaborate!