Intro

As many of my friends and colleagues know, I’m a prospective pet owner that’s hoping to adopt a cat (or two…) in the next few months.

As a result, I tend to frequently check up on the cats available through the Animal Welfare League of Arlington. Partially because they’re just cute to look at, and also to know whether any cats I have my eye on are still there or if there are any new ones to admire.

Being the data nerd I am, I immediately began wondering if I could write a script which notified me whenever a new cat popped up on the site. Then I started wondering if I could take that a step further and do analysis on the cats posted on the site, how long they were listed, the ages that tend to be adopted, etc.

This post was born out of that idea, as well as the idea that anything can be data!

Walkthrough of scraping this data

Confirm that you’re allowed to scrape the site!

It’s best practice to confirm that you’re allowed to scrape a site before you do so by checking robots.txt.¹

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

In this case, we’re good to go as long as we aren’t touching anything related to wordpress admin.

Getting started

We’ll need 3 libraries to get started: requests, bs4, and pandas.

import requests
from bs4 import BeautifulSoup
import pandas as pd

If you look at other bs4 tutorials², they all start off with parsing the contents of the website of interest’s html. So we make a request to the site and then parse the response with BeautifulSoup:

url = "https://www.awla.org/adopt/cats/"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

Figuring out what exactly to scrape

This is where having some domain knowledge comes in handy, and I certainly have a lot of domain knowledge about cats and what to look for 😄. For example, I already know that there are some specific things about each cat that I want to know. Specifically their:

name
age
gender
breed
whether they’re at the shelter/in foster
a link to their page
a link to their picture (if available)
a description of them
whether they’re fixed
color
whether they’re a part of a bonded pair

Knowing that this specifically is the information we’re looking to collect, we can set up a dictionary to hold it. I’m also going to set a timestamp so we know when we pulled everything

all_the_cats = {
    "name":[], 
    "age":[], 
    "gender":[], 
    "breed":[], 
    "location":[], 
    "link":[], 
    "image_link":[], 
    "description":[], 
    "fixed":[], 
    "color":[], 
    "bonded_pair":[], 
    "snapshot_at":[] 
}

snapshot_timestamp = pd.Timestamp.now() # so all the data will have the same timestamp from when we run our script.

Finding that information on the page

We need to figure out where exactly in the html we’ve parsed that this information lives. Generally, I think it’s easiest to do this by inspecting the site’s code and determing where the elements of interest live.

From inspecting the HTML of the site, I determined that most of the information about the cats lives in a div class of pet-grid-item-main.

cats = soup.find_all("div", class_= "pet-grid-item-main")

Grab all the information

Within this class, we still have to determine which elements to pull out in order to form a more structured dataset.

We can take a look at the div classes we assigned to cats for the first cat to get an idea of what we’re looking at:

cats[0]

<div class="pet-grid-item-main">
<div class="fl-post-image pet-grid-image">
<a href="https://www.awla.org/pet/jake-from-state-farm/" title="Jake From State Farm"><img alt="" class="wp-post-image" decoding="async" height="466" itemprop="image" loading="lazy" sizes="(max-width: 420px) 100vw, 420px" src="https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg" srcset="https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg 420w, https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74-270x300.jpg 270w" width="420"/></a>
</div>
<div class="fl-post-content pet-grid-overlay">
<div class="pet-grid-overlay-content">
<h2 class="fl-post-title sleeve"><a href="https://www.awla.org/pet/jake-from-state-farm/" title="Jake From State Farm">Jake From State Farm</a></h2>
<div class="pet-grid-breed">Domestic Shorthair</div>
</div>
<div class="pet-grid-meta-group">
<span class="pet-grid-meta">Young</span>
<span class="pet-grid-meta">Male</span>
</div>
</div>
<ul class="fl-pet-location"><li><span class="in-foster">In Foster</span></li></ul>
<a class="pet-overlay-link" href="https://www.awla.org/pet/jake-from-state-farm/"></a>
</div>

This makes it clear how these objects are structured and the information that they contain. For example, if we want to find the name of the cat, that is in the title of the <a href> element. We can pull this out using find:

print(cats[0].find("a", href = True).get("title"))

Jake From State Farm

We can determine the same for some of the other information of interest, the only tricky piece being that pet-grid-meta requires the use of find_all since it contains both the cat’s age and gender.

for cat in cats:
    all_the_cats["name"].append(cat.find("a", href = True).get("title"))
    all_the_cats["age"].append(cat.find_all("span", class_="pet-grid-meta")[0].text)
    all_the_cats["gender"].append(cat.find_all("span", class_="pet-grid-meta")[1].text)
    all_the_cats["breed"].append(cat.find("div", class_="pet-grid-breed").text)
    all_the_cats["location"].append(cat.find("ul", class_="fl-pet-location").text)
    all_the_cats["link"].append(cat.find("a", href = True).get("href"))
    all_the_cats["image_link"].append(cat.find("img").get("src"))

all_the_cats

{'name': ['Jake From State Farm', 'Mia', 'Sherry'],
 'age': ['Young', 'Adult', 'Adult'],
 'gender': ['Male', 'Female', 'Female'],
 'breed': ['Domestic Shorthair', 'Domestic Shorthair', 'Domestic Shorthair'],
 'location': ['In Foster', 'In Foster', 'In Foster'],
 'link': ['https://www.awla.org/pet/jake-from-state-farm/',
  'https://www.awla.org/pet/mia/',
  'https://www.awla.org/pet/sherry/'],
 'image_link': ['https://www.awla.org/wp-content/uploads/2023/03/719d1823-b4da-4ef2-b747-85b15a60bd74.jpg',
  'https://www.awla.org/wp-content/uploads/2023/03/6f7216e9-2790-45ad-9c1f-75a679b9bbda.jpg',
  'https://www.awla.org/wp-content/uploads/2023/02/fd6bbff3-b65c-4cc4-b133-c55951c6d9e2.jpg'],
 'description': [],
 'fixed': [],
 'color': [],
 'bonded_pair': [],
 'snapshot_at': []}

Notice that we’re still missing some stuff, like each cat’s description, color, etc. This information is contained on each cat’s individual page, which is found via the link that we’ve pulled above. That means we’ll have to parse out the remaining information from these pages separately.

An alternative way of grabbing information

If I were interested only in the information on the first page, the script could be much simpler. Instead of explicitly defining the dictionary upfront, we could do something like this:

data = []
for cat in cats:
    cats_example_dict = {}
    cats_example_dict["name"] = cat.find("a", href = True).get("title")
    cats_example_dict["age"] = cat.find_all("span", class_="pet-grid-meta")[0].text
    cats_example_dict["gender"] = cat.find_all("span", class_="pet-grid-meta")[1].text
    cats_example_dict["breed"] = cat.find("div", class_="pet-grid-breed").text
    cats_example_dict["location"] = cat.find("ul", class_="fl-pet-location").text
    cats_example_dict["link"] = cat.find("a", href = True).get("href")
    cats_example_dict["image_link"] = cat.find("img").get("src")
    data.append(cats_example_dict)

pd.DataFrame(data)

	name	age	gender	breed	location	link	image_link
0	Jake From State Farm	Young	Male	Domestic Shorthair	In Foster	https://www.awla.org/pet/jake-from-state-farm/	https://www.awla.org/wp-content/uploads/2023/0...
1	Mia	Adult	Female	Domestic Shorthair	In Foster	https://www.awla.org/pet/mia/	https://www.awla.org/wp-content/uploads/2023/0...
2	Sherry	Adult	Female	Domestic Shorthair	In Foster	https://www.awla.org/pet/sherry/	https://www.awla.org/wp-content/uploads/2023/0...

Getting the rest of the information we wanted

We can follow the same process as before to grab the additional information from each cat’s page. We can do this programmatically by looping through each cat’s link, and creating a list of BeautifulSoup objects.

description_pages = []  

for link in all_the_cats["link"]:
    response = requests.get(link)
    description_pages.append(BeautifulSoup(response.content, "html.parser"))

Now we’ve collected all of the cats’ description pages into an object we can iterate through. We can append these to the dictionary elements that we created earlier.

for page in description_pages:
    all_the_cats["fixed"].append(page.find_all("div", class_="pet-meta-info")[6].text.replace("\n", "").replace("Fixed",""))
    all_the_cats["color"].append(page.find_all("div", class_="pet-meta-info")[4].text.replace("\n", "").replace("Color", ""))
    ## get the description, give a blank if they're blank 
    if page.find("meta", property="og:description") == None:
        all_the_cats["description"].append(None)
    else:
        all_the_cats["description"].append(page.find("meta", property="og:description").get("content"))
    ## check if the cats are part of a bonded pair 
    if page.find("div", class_="fl-col fl-post-column buddy-col") == None:
        all_the_cats["bonded_pair"].append(False)
    else: 
        all_the_cats["bonded_pair"].append(True)
    ## add the timestamp
    all_the_cats["snapshot_at"].append(snapshot_timestamp)

Then we can easily create a pandas dataframe from the results of our scraping.

cats_df = pd.DataFrame(all_the_cats)
cats_df

	name	age	gender	breed	location	link	image_link	description	fixed	color	bonded_pair	snapshot_at
0	Jake From State Farm	Young	Male	Domestic Shorthair	In Foster	https://www.awla.org/pet/jake-from-state-farm/	https://www.awla.org/wp-content/uploads/2023/0...	Like a good neighbor, Jake From State Farm is ...	Yes	Brown	False	2023-03-14 16:08:10.421041
1	Mia	Adult	Female	Domestic Shorthair	In Foster	https://www.awla.org/pet/mia/	https://www.awla.org/wp-content/uploads/2023/0...	Mia is a beautiful dilute tortie whose petite ...	No	Bronze	False	2023-03-14 16:08:10.421041
2	Sherry	Adult	Female	Domestic Shorthair	In Foster	https://www.awla.org/pet/sherry/	https://www.awla.org/wp-content/uploads/2023/0...	When Sherry isn't finding the coziest hiding s...	Yes	Black	False	2023-03-14 16:08:10.421041

Bonus step - storing our scraped data

In the real world, we aren’t often scraping data as one-off reports and doing nothing with it. Usually we’re taking the results of our work and storing it elsewhere.

There are a few options for doing this depending on your tech stack. For example, you could save the raw dictionary files to a data lake (like s3) as JSON and parse these later into tables for analysis.

Or, in our case where we’re dealing with pretty small amounts of data, you could just push these straight to a SQL database. I chose to do this since it’s the simplest solution for me.

We can write a pretty simple, extendable function to do this for us using pandas. In my case, since I’m using a local postgreSQL server, that looks something like this:

from sqlalchemy import create_engine

def write_table(pd_table, table = "", user_name = "", password = ""):
    conn_string = f"postgresql://localhost/postgres?user={user_name}&password={password}"
    conn = create_engine(conn_string)

    pd_table.to_sql(
        table,
        schema = "misc",
        con = conn,
        if_exists = "append",
        index = False
    )

write_table(
    cats_df, 
    table = "awla_cat_snapshots", 
    user_name = getenv("PG_USER"), # use environment variables for usernames/passwords!
    password = getenv("PG_PASS")
)

Since I’ve been scraping this page for a few days now before this write-up, I can answer questions with SQL like: which cats on the site are new? In other words, which cats have only appeared on the site once since I started collecting data?

WITH rn_table AS (
  SELECT 
    *, 
    ROW_NUMBER() OVER (PARTITION BY link ORDER BY snapshot_at ASC) AS n_appearances
  FROM misc.awla_cat_snapshots acs
)

SELECT 
  DISTINCT name
FROM rn_table
WHERE n_appearances = 1
  AND DATE(snapshot_at) = DATE(current_date)

#> Mia

As of the day I wrote this post, Mia is the newest cat on the site.

We can get Mia’s description from the object we created earlier:

Code

print(cats_df[cats_df["name"] == "Mia"]["description"].values[0])

Mia is a beautiful dilute tortie whose petite stature doesn't reflect her big personality! She's spent time in a couple of foster homes and they all say the same things…

Wrapping up

That’s it! We’ve successfully written some code that allows us to turn listings of cats into structured data that we can store and update to use for analysis.

If you’ve made it to the end of this post, please consider making a donation to the Animal Welfare League of Arlington, or adopting if you’re local!

Footnotes

there are other best practices to follow, as well as just general courtesy to observe when web scraping like not making too many requests.↩︎
I generally followed the tutorial on the Real Python site.↩︎