[practice]Web & APIs
Scraping Tables
# theory
pd.read_html
pd.read_html parses every table on a page into DataFrames. In a regular script you can hand it a URL. In Pyodide, the network goes through pyfetch, so the recipe is:
from pyodide.http import pyfetch
import pandas as pd
resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()
tables = pd.read_html(html) # list of DataFrames, one per <table>
df = tables[0]
The site bundles a real HTML page at /sample-data/electronics-store.html containing an inventory table. The examples below pull from it.
picking the right table
When a page has more than one table, three tools narrow it down:
pd.read_html(html, match="Inventory") # tables containing this text
pd.read_html(html, attrs={"id": "inventory-table"}) # by attribute
pd.read_html(html, header=0) # explicit header row
cleaning
pd.read_html returns strings for every column by default, because HTML doesn't know types. You'll always have a cleanup step.
df = pd.read_html(html, attrs={"id": "inventory-table"})[0]
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)
Currency or formatted numbers need a regex strip first:
df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)
what read_html can't do
- Pages where the table is rendered by JavaScript after page load.
pyfetchgets the raw HTML; if the table isn't in the initial markup, it's not there to parse. - Cells with images instead of text. The image alt attribute is sometimes a good fallback, but
read_htmlignores it; you have to drop to BeautifulSoup. - Tables with merged cells (rowspan/colspan). Pandas tries, but you'll often need to clean the output by hand.
# examples [3]
When you don't know exactly which table you want, read everything and inspect.
Once you have a DataFrame, the rest is normal pandas.
Pass match= to skip tables you don't want. Useful when a page has nav tables, sidebar tables, etc.
# challenges [2]
# project
# project-challenge
thread: Sales Performance Dashboard · reward: 50 xp
# brief
The weekly sales report is distributed as an HTML table. Use pd.read_html to extract the data and clean up the currency formatting for analysis.
# task