[concept]Web & APIs
BeautifulSoup Basics
# theory
BeautifulSoup
BeautifulSoup parses HTML/XML, letting you extract data from web pages:
from bs4 import BeautifulSoup
html = "<html><body><h1>Title</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html, "html.parser")
print(soup.h1.text) # "Title"
print(soup.p.text) # "Content"
finding elements
# Find first matching element
soup.find("div")
soup.find("div", class_="container")
soup.find("a", href="/page")
# Find all matching elements
soup.find_all("p")
soup.find_all("a")
soup.find_all("div", class_="item")
extracting
element = soup.find("a")
element.text # Text content
element.get("href") # Attribute value
element["href"] # Same, but may raise error
element.get_text(strip=True) # Clean text
css selectors
soup.select("div.container") # Class selector
soup.select("#main") # ID selector
soup.select("div p") # Descendant
soup.select("div > p") # Direct child
soup.select("a[href]") # Has attribute
scraping a list
items = []
for row in soup.find_all("tr"):
cols = row.find_all("td")
if cols:
items.append({
"name": cols[0].text.strip(),
"price": cols[1].text.strip()
})
with pyfetch
In this site we fetch HTML over the network with pyfetch (same-origin works without any CORS hassle), then feed response.string() to BeautifulSoup. Same code as a regular Python script except for the await.
from pyodide.http import pyfetch
from bs4 import BeautifulSoup
response = await pyfetch("/sample-data/electronics-store.html")
html = await response.string()
soup = BeautifulSoup(html, "html.parser")
print(soup.find("h1").get_text())
There's a real HTML page bundled with this site at /sample-data/electronics-store.html for the examples below. It has an inventory table, category list, and "featured" section so you can practice every common selector against actual markup, not a synthetic string.
# examples [3]
find returns the first match; find_all returns every match. Selectors land on the same elements.
Real scraping needs the attribute, not just the text. element["attr"] or element.get("attr") both work.
Lesson 29 uses pd.read_html for this. Doing it by hand once teaches you what read_html is doing under the hood.
# challenges [2]
# project
# project-challenge
thread: Sales Performance Dashboard · reward: 50 xp
# brief
The legacy system exports sales data as HTML. Extract the sales information from the HTML table structure using regex patterns.
# task