Webscraping Day 3 of 3
Day 48 of 100
Webscraping is a powerful and useful tool for python applications. There is a lot of great data and information on the internet that is not accessible by API or for download in a usable form. Instead, website data can be gleaned by downloading the html content and then using a tool like Beautiful Soup 4 to parse and organize the data. A little knowledge about the structure of html paired with Beautiful Soup 4 makes gathering data from the web accessible to novice programmers like myself.
For this code challenge, I took the opportunity to make managing a neighborhood sports pool easier. I have been using an elaborate system of Google Sheets, formulas, and mostly unreliable copying and pasting of html tables in to sheets to score the pool weekly. My goal is to migrate the scoring of the pool into a python application and maybe even a web application. Achieving that goal will take more time but as with every big project, you have to start somewhere. One thing the application will have to do is pull the scores from a website so that fits nicely in the webscraping topic in the 100 days of code challenge I am trying to complete.
The webscraping function that I wrote for this challenge is a webscraper that converts a generic html table to a list and then writes that list to a csv which proved to be a challenge for this novice programmer. The easiest webscraping operations take place when the portion of the page you are targeting is easily identifiable with css
classes or tags to make searching easier. For this scraper function, there is no uniquely identifiable information for the table data I am trying to scrape, but it is the only table on the page.
The first part of the webscraping process is to get the raw site data. This is easy using the requests
package. Simply passing the URL to the function will download the site to a requests
object which can then be passed to Beautiful Soup 4 for parsing and analysis.
Requests Best Practices
For the script and application I am working on, requesting the data from the site only happens once or twice. Hammering sites with requests to pull data can get you banned from a site and cause headaches for the sites that are providing the data. It is best to put the request into a separate script and run that on a cronjob only when necessary and save the data to run the Beautiful Soup 4 data analysis on.
For this script and application, the request is only made once or twice so I have included it and it does run each time the script is run. In this function, after the request is made .raise_for_status()
is run on the result to check for success.
app.py
import requests
...
def pull_site(url):
raw_site = requests.get(url)
try:
raw_site.raise_for_status()
except:
print(
"There was an error trying to retrieve the data from the results page.")
return None
return raw_site
...
Once the content from the page has been saved using requests
it is passed to a function the looks for a table and then returns a list of lists with each list being one line from the table. First, create an empty list to store the lines from the table. Then, pass the raw site data that was received from requests
to the bs4
package to create a BeautifulSoup
object for analysis. On the site that I am targeting, there is one table on the page so I can get away with the soup.find_all("tr")
to get all the rows from the table. If there was more than one table on the site, I would have needed to target the correct table to get the data I am looking for. Then running for row in table[1:]:
will loop through the rows in the table so I can grab the text from the <td>
tags, which is ultimately what I am after. There is one unique aspect of the table on the site I am scraping: the first row is not the header row, but a weird title row so I hard coded table[1:]
into the for loop so it skips the title row and the first row it runs on have the table headers in it. Finally, using Beautiful Soup 4 again to row.find_all("td)
pull the <td>
tags into a list. From there, I can finally grab the text from each <td>
tag and append it to the td_list
and then append the td_list
to the data_list
- thus adding each row of data in the table as a list of strings. The function then returns the list of lists that I can use later or write to a csv to save for later.
app.py
import bs4
def scrape(raw_site):
data_list = []
soup = bs4.BeautifulSoup(raw_site.text, 'html.parser')
# get the table
table = soup.find_all("tr")
for row in table[1:]:
td_tags = row.find_all("td")
td_list = []
for tag in td_tags:
td_list.append(tag.get_text())
data_list.append(td_list)
return data_list
...
Finally, I wrote a function to save the list of table data to a csv. Then I can save the data to have a record of it for later and use it to run the scoring scripts against within the larger application. There is a little more than just saving the list to a csv. There is a conditional that check for the presence of a file in the data/
directory that matches the name of the file that will be created for this data. If the file exists, the user is prompted to decide if they want to quit the operation or overwrite the data. If the file does not exist, there is no prompt to the user, simply a message that the operation was successful.
app.y
import csv
...
def write_results_to_csv(data, id_num):
# check to see if results file exists already
if os.path.isfile(f"data/{id_num}_results.csv"):
overwrite = None
while not overwrite:
overwrite = input(f"The file {id_num}_results.csv already exists, do you want to overwrite? [Y/N]: ")
if overwrite.lower() == 'y':
with open(f"data/{id_num}_results.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(data)
print(
f"Results successfully written to data/{id_num}_results.csv")
elif overwrite.lower() == 'n':
print(f"Please rename the results file or choose a different race and run again.")
raise SystemExit(0)
elif overwrite.lower() == 'q':
print(f"Good bye.")
raise SystemExit(0)
else:
print(f"[{overwrite}] is not a valid response. Please enter Y or N or [Q]uit to continue: ")
overwrite = None
else:
with open(f"data/{id_num}_results.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(data)
print(f"Results successfully written to data/{id_num}_results.csv")
Webscraping a powerful tool and Beautiful Soup 4 is a package that transforms the task of making the data usable into something even a novice programmer can understand. With an easy way to get the scores and results, I can now proceed with the development of the larger application, making my life even easier in the process.