import requests
from bs4 import BeautifulSoup
import pandas as pd
Intro
As many of my friends and colleagues know, I’m a prospective pet owner that’s hoping to adopt a cat (or two…) in the next few months.
As a result, I tend to frequently check up on the cats available through the Animal Welfare League of Arlington. Partially because they’re just cute to look at, and also to know whether any cats I have my eye on are still there or if there are any new ones to admire.
Being the data nerd I am, I immediately began wondering if I could write a script which notified me whenever a new cat popped up on the site. Then I started wondering if I could take that a step further and do analysis on the cats posted on the site, how long they were listed, the ages that tend to be adopted, etc.
This post was born out of that idea, as well as the idea that anything can be data!
Walkthrough of scraping this data
Confirm that you’re allowed to scrape the site!
It’s best practice to confirm that you’re allowed to scrape a site before you do so by checking robots.txt
.1
User-agent: *
Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
In this case, we’re good to go as long as we aren’t touching anything related to wordpress admin.
Getting started
We’ll need 3 libraries to get started: requests
, bs4
, and pandas
.
If you look at other bs4
tutorials2, they all start off with parsing the contents of the website of interest’s html. So we make a request to the site and then parse the response with BeautifulSoup
:
= "https://www.awla.org/adopt/cats/"
url = requests.get(url)
page
= BeautifulSoup(page.content, "html.parser") soup
Figuring out what exactly to scrape
This is where having some domain knowledge comes in handy, and I certainly have a lot of domain knowledge about cats and what to look for 😄. For example, I already know that there are some specific things about each cat that I want to know. Specifically their:
- name
- age
- gender
- breed
- whether they’re at the shelter/in foster
- a link to their page
- a link to their picture (if available)
- a description of them
- whether they’re fixed
- color
- whether they’re a part of a bonded pair
Knowing that this specifically is the information we’re looking to collect, we can set up a dictionary to hold it. I’m also going to set a timestamp so we know when we pulled everything
= {
all_the_cats "name":[],
"age":[],
"gender":[],
"breed":[],
"location":[],
"link":[],
"image_link":[],
"description":[],
"fixed":[],
"color":[],
"bonded_pair":[],
"snapshot_at":[]
}
= pd.Timestamp.now() # so all the data will have the same timestamp from when we run our script. snapshot_timestamp
Finding that information on the page
We need to figure out where exactly in the html we’ve parsed that this information lives. Generally, I think it’s easiest to do this by inspecting the site’s code and determing where the elements of interest live.
From inspecting the HTML of the site, I determined that most of the information about the cats lives in a div class of pet-grid-item-main
.
= soup.find_all("div", class_= "pet-grid-item-main") cats
Grab all the information
Within this class, we still have to determine which elements to pull out in order to form a more structured dataset.
We can take a look at the div classes we assigned to cats
for the first cat to get an idea of what we’re looking at:
0] cats[
<div class="pet-grid-item-main">
<div class="fl-post-image pet-grid-image">
<a href="https://www.awla.org/pet/jake-from-state-farm/" title="Jake From State Farm"><img alt="" class="wp-post-image" decoding="async" height="466" itemprop="image" loading="lazy" sizes="(max-width: 420px) 100vw, 420px" src="https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg" srcset="https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg 420w, https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74-270x300.jpg 270w" width="420"/></a>
</div>
<div class="fl-post-content pet-grid-overlay">
<div class="pet-grid-overlay-content">
<h2 class="fl-post-title sleeve"><a href="https://www.awla.org/pet/jake-from-state-farm/" title="Jake From State Farm">Jake From State Farm</a></h2>
<div class="pet-grid-breed">Domestic Shorthair</div>
</div>
<div class="pet-grid-meta-group">
<span class="pet-grid-meta">Young</span>
<span class="pet-grid-meta">Male</span>
</div>
</div>
<ul class="fl-pet-location"><li><span class="in-foster">In Foster</span></li></ul>
<a class="pet-overlay-link" href="https://www.awla.org/pet/jake-from-state-farm/"></a>
</div>
This makes it clear how these objects are structured and the information that they contain. For example, if we want to find the name of the cat, that is in the title of the <a href>
element. We can pull this out using find
:
print(cats[0].find("a", href = True).get("title"))
Jake From State Farm
We can determine the same for some of the other information of interest, the only tricky piece being that pet-grid-meta
requires the use of find_all
since it contains both the cat’s age and gender.
for cat in cats:
"name"].append(cat.find("a", href = True).get("title"))
all_the_cats["age"].append(cat.find_all("span", class_="pet-grid-meta")[0].text)
all_the_cats["gender"].append(cat.find_all("span", class_="pet-grid-meta")[1].text)
all_the_cats["breed"].append(cat.find("div", class_="pet-grid-breed").text)
all_the_cats["location"].append(cat.find("ul", class_="fl-pet-location").text)
all_the_cats["link"].append(cat.find("a", href = True).get("href"))
all_the_cats["image_link"].append(cat.find("img").get("src"))
all_the_cats[
all_the_cats
{'name': ['Jake From State Farm', 'Mia', 'Sherry'],
'age': ['Young', 'Adult', 'Adult'],
'gender': ['Male', 'Female', 'Female'],
'breed': ['Domestic Shorthair', 'Domestic Shorthair', 'Domestic Shorthair'],
'location': ['In Foster', 'In Foster', 'In Foster'],
'link': ['https://www.awla.org/pet/jake-from-state-farm/',
'https://www.awla.org/pet/mia/',
'https://www.awla.org/pet/sherry/'],
'image_link': ['https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg',
'https://www.awla.org/wp-content/uploads/2023/03/6f7216e9-2790-45ad-9c1f-75a679b9bbda.jpg',
'https://www.awla.org/wp-content/uploads/2023/02/fd6bbff3-b65c-4cc4-b133-c55951c6d9e2.jpg'],
'description': [],
'fixed': [],
'color': [],
'bonded_pair': [],
'snapshot_at': []}
Notice that we’re still missing some stuff, like each cat’s description, color, etc. This information is contained on each cat’s individual page, which is found via the link that we’ve pulled above. That means we’ll have to parse out the remaining information from these pages separately.
An alternative way of grabbing information
If I were interested only in the information on the first page, the script could be much simpler. Instead of explicitly defining the dictionary upfront, we could do something like this:
= []
data for cat in cats:
= {}
cats_example_dict "name"] = cat.find("a", href = True).get("title")
cats_example_dict["age"] = cat.find_all("span", class_="pet-grid-meta")[0].text
cats_example_dict["gender"] = cat.find_all("span", class_="pet-grid-meta")[1].text
cats_example_dict["breed"] = cat.find("div", class_="pet-grid-breed").text
cats_example_dict["location"] = cat.find("ul", class_="fl-pet-location").text
cats_example_dict["link"] = cat.find("a", href = True).get("href")
cats_example_dict["image_link"] = cat.find("img").get("src")
cats_example_dict[
data.append(cats_example_dict)
pd.DataFrame(data)
name | age | gender | breed | location | link | image_link | |
---|---|---|---|---|---|---|---|
0 | Jake From State Farm | Young | Male | Domestic Shorthair | In Foster | https://www.awla.org/pet/jake-from-state-farm/ | https://www.awla.org/wp-content/uploads/2023/0... |
1 | Mia | Adult | Female | Domestic Shorthair | In Foster | https://www.awla.org/pet/mia/ | https://www.awla.org/wp-content/uploads/2023/0... |
2 | Sherry | Adult | Female | Domestic Shorthair | In Foster | https://www.awla.org/pet/sherry/ | https://www.awla.org/wp-content/uploads/2023/0... |
Getting the rest of the information we wanted
We can follow the same process as before to grab the additional information from each cat’s page. We can do this programmatically by looping through each cat’s link, and creating a list of BeautifulSoup
objects.
= []
description_pages
for link in all_the_cats["link"]:
= requests.get(link)
response "html.parser")) description_pages.append(BeautifulSoup(response.content,
Now we’ve collected all of the cats’ description pages into an object we can iterate through. We can append these to the dictionary elements that we created earlier.
for page in description_pages:
"fixed"].append(page.find_all("div", class_="pet-meta-info")[6].text.replace("\n", "").replace("Fixed",""))
all_the_cats["color"].append(page.find_all("div", class_="pet-meta-info")[4].text.replace("\n", "").replace("Color", ""))
all_the_cats[## get the description, give a blank if they're blank
if page.find("meta", property="og:description") == None:
"description"].append(None)
all_the_cats[else:
"description"].append(page.find("meta", property="og:description").get("content"))
all_the_cats[## check if the cats are part of a bonded pair
if page.find("div", class_="fl-col fl-post-column buddy-col") == None:
"bonded_pair"].append(False)
all_the_cats[else:
"bonded_pair"].append(True)
all_the_cats[## add the timestamp
"snapshot_at"].append(snapshot_timestamp) all_the_cats[
Then we can easily create a pandas dataframe from the results of our scraping.
= pd.DataFrame(all_the_cats)
cats_df cats_df
name | age | gender | breed | location | link | image_link | description | fixed | color | bonded_pair | snapshot_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jake From State Farm | Young | Male | Domestic Shorthair | In Foster | https://www.awla.org/pet/jake-from-state-farm/ | https://www.awla.org/wp-content/uploads/2023/0... | Like a good neighbor, Jake From State Farm is ... | Yes | Brown | False | 2023-03-14 16:08:10.421041 |
1 | Mia | Adult | Female | Domestic Shorthair | In Foster | https://www.awla.org/pet/mia/ | https://www.awla.org/wp-content/uploads/2023/0... | Mia is a beautiful dilute tortie whose petite ... | No | Bronze | False | 2023-03-14 16:08:10.421041 |
2 | Sherry | Adult | Female | Domestic Shorthair | In Foster | https://www.awla.org/pet/sherry/ | https://www.awla.org/wp-content/uploads/2023/0... | When Sherry isn't finding the coziest hiding s... | Yes | Black | False | 2023-03-14 16:08:10.421041 |
Bonus step - storing our scraped data
In the real world, we aren’t often scraping data as one-off reports and doing nothing with it. Usually we’re taking the results of our work and storing it elsewhere.
There are a few options for doing this depending on your tech stack. For example, you could save the raw dictionary files to a data lake (like s3) as JSON and parse these later into tables for analysis.
Or, in our case where we’re dealing with pretty small amounts of data, you could just push these straight to a SQL database. I chose to do this since it’s the simplest solution for me.
We can write a pretty simple, extendable function to do this for us using pandas
. In my case, since I’m using a local postgreSQL
server, that looks something like this:
from sqlalchemy import create_engine
def write_table(pd_table, table = "", user_name = "", password = ""):
= f"postgresql://localhost/postgres?user={user_name}&password={password}"
conn_string = create_engine(conn_string)
conn
pd_table.to_sql(
table,= "misc",
schema = conn,
con = "append",
if_exists = False
index
)
write_table(
cats_df, = "awla_cat_snapshots",
table = getenv("PG_USER"), # use environment variables for usernames/passwords!
user_name = getenv("PG_PASS")
password )
Since I’ve been scraping this page for a few days now before this write-up, I can answer questions with SQL like: which cats on the site are new? In other words, which cats have only appeared on the site once since I started collecting data?
WITH rn_table AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY link ORDER BY snapshot_at ASC) AS n_appearances
FROM misc.awla_cat_snapshots acs
)
SELECT
DISTINCT name
FROM rn_table
WHERE n_appearances = 1
AND DATE(snapshot_at) = DATE(current_date)
> Mia #
As of the day I wrote this post, Mia is the newest cat on the site.
We can get Mia’s description from the object we created earlier:
Code
print(cats_df[cats_df["name"] == "Mia"]["description"].values[0])
Mia is a beautiful dilute tortie whose petite stature doesn't reflect her big personality! She's spent time in a couple of foster homes and they all say the same things…
Wrapping up
That’s it! We’ve successfully written some code that allows us to turn listings of cats into structured data that we can store and update to use for analysis.
If you’ve made it to the end of this post, please consider making a donation to the Animal Welfare League of Arlington, or adopting if you’re local!
Footnotes
there are other best practices to follow, as well as just general courtesy to observe when web scraping like not making too many requests.↩︎
I generally followed the tutorial on the Real Python site.↩︎