pyodide: loading…

[practice]Web & APIs

Scraping Tables

# theory

pd.read_html

pd.read_html parses every table on a page into DataFrames. In a regular script you can hand it a URL. In Pyodide, the network goes through pyfetch, so the recipe is:

from pyodide.http import pyfetch
import pandas as pd

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()
tables = pd.read_html(html)   # list of DataFrames, one per <table>
df = tables[0]

The site bundles a real HTML page at /sample-data/electronics-store.html containing an inventory table. The examples below pull from it.

picking the right table

When a page has more than one table, three tools narrow it down:

pd.read_html(html, match="Inventory")          # tables containing this text
pd.read_html(html, attrs={"id": "inventory-table"})  # by attribute
pd.read_html(html, header=0)                   # explicit header row

cleaning

pd.read_html returns strings for every column by default, because HTML doesn't know types. You'll always have a cleanup step.

df = pd.read_html(html, attrs={"id": "inventory-table"})[0]
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)

Currency or formatted numbers need a regex strip first:

df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)

what read_html can't do

  • Pages where the table is rendered by JavaScript after page load. pyfetch gets the raw HTML; if the table isn't in the initial markup, it's not there to parse.
  • Cells with images instead of text. The image alt attribute is sometimes a good fallback, but read_html ignores it; you have to drop to BeautifulSoup.
  • Tables with merged cells (rowspan/colspan). Pandas tries, but you'll often need to clean the output by hand.

# examples [3]

# example 01 · all tables on a page

When you don't know exactly which table you want, read everything and inspect.

1
2
3
4
5
6
7
8
9
10
11
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · aggregating after the parse

Once you have a DataFrame, the rest is normal pandas.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · filtering during the parse

Pass match= to skip tables you don't want. Useful when a page has nav tables, sidebar tables, etc.

1
2
3
4
5
6
7
8
9
10
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Fetch /sample-data/electronics-store.html, parse the inventory table, and print the total stock count across all rows in the format 'total stock: N' where N is an integer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Same page. Find the single category with the highest total inventory value (Price * Stock summed). Print 'top category: NAME' where NAME is the category name.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

The weekly sales report is distributed as an HTML table. Use pd.read_html to extract the data and clean up the currency formatting for analysis.

# task

Parse HTML Sales Report

# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
🐍
Loading PythonSetting up pandas & numpy...