Following on from last weeks post where we analysed the amount of repeated letters within current New Zealand town names. There was still one part of that analysis that really bugged me, and if you noticed it was from the data set that was used was using the European town names not the original Maori names. This post will be dedicated to introducing web scraping where we will extract the Maori names and run a similar analysis to present an interactive graph.
As like previously, let's take a look at the interactive graph before getting into how it was created.
from bokeh.resources import CDN from bokeh.embed import file_html html = file_html(p, CDN, "NZ_City_Letter_Analysis") from IPython.core.display import HTML HTML(html)
Similarly with most of my posts of this nature, we always begin by getting the data. To find a data set that gives us as many Maori town or place names as possible proved to be quite challenging, but luckily for Maori Language week NZhistory.gov.nz posted a table of a 1000 Maori place names, their components and the meaning. This data can be found: https://nzhistory.govt.nz/culture/maori-language-week/1000-maori-place-names.
Unlike last time however with our world city names from Kaggle, this data isn't nicely supplied to us in an Excel format. While it may be possible to directly copy-paste from the website into a spreadsheet, I think this is a great way to ease into web scraping.
What is Web Scraping
Web scraping, web harvesting or web data extraction is the process of extracting data from websites. To do this in Python, while there is multiple ways to achieve this (requests + beautiful soup, selenium, etc), my personal favourite package to use is Scrapy. While it may be daunting to begin with from a non object-oriented basis, you will soon appreciate it more once you've begun using it.
Initially the premise around the Scrapy package is to create 'web spiders'. If we take a look of the structure of the first example on the Scrapy website we get an understanding on how to structure our web spiders when developing:
1 2 3 4 5 6 7 8 9 10 11
Following on we can see a function for parsing (also specially named) in which there are 2 loops, the first for loop is going to loop through all title's marked as headers (specifically h2) and return a dictionary with the text in the heading.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Now that we have created our spider that looks through each row of the table on the webpage (more information on determining this can be found: https://docs.scrapy.org/en/latest/intro/tutorial.html). It's time to run the spider and take a look at the output. To run a spider you go into the directory from the command line and run 'scrapy crawl \<spider name>' and to store an output at the same time 'scrapy crawl \<spider name> -o filename.csv -t csv.
Now similar to the previous post, we run a similar analysis and plot with Bokeh!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57