python-mastery

# theory

pd.read_html

pd.read_html parses every table on a page into DataFrames. In a regular script you can hand it a URL. In Pyodide, the network goes through pyfetch, so the recipe is:

from pyodide.http import pyfetch
import pandas as pd

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()
tables = pd.read_html(html)   # list of DataFrames, one per <table>
df = tables[0]

The site bundles a real HTML page at /sample-data/electronics-store.html containing an inventory table. The examples below pull from it.

picking the right table

When a page has more than one table, three tools narrow it down:

pd.read_html(html, match="Inventory")          # tables containing this text
pd.read_html(html, attrs={"id": "inventory-table"})  # by attribute
pd.read_html(html, header=0)                   # explicit header row

cleaning

pd.read_html returns strings for every column by default, because HTML doesn't know types. You'll always have a cleanup step.

df = pd.read_html(html, attrs={"id": "inventory-table"})[0]
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)

Currency or formatted numbers need a regex strip first:

df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)

what read_html can't do

Pages where the table is rendered by JavaScript after page load. pyfetch gets the raw HTML; if the table isn't in the initial markup, it's not there to parse.
Cells with images instead of text. The image alt attribute is sometimes a good fallback, but read_html ignores it; you have to drop to BeautifulSoup.
Tables with merged cells (rowspan/colspan). Pandas tries, but you'll often need to clean the output by hand.

# examples [3]

# example 01 · all tables on a page

When you don't know exactly which table you want, read everything and inspect.

1

2

3

4

5

6

7

8

9

10

11

🐍

Loading PythonSetting up pandas & numpy...

# example 02 · aggregating after the parse

Once you have a DataFrame, the rest is normal pandas.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

import pandas as pd
from pyodide.http import pyfetch

resp = await pyfetch("/sample-data/electronics-store.html")
df = pd.read_html(await resp.string(), attrs={"id": "inventory-table"})[0]

df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)

# Stock by category
by_cat = df.groupby("Category")["Stock"].sum().sort_values(ascending=False)
print(by_cat)

# Inventory value by category
df["Value"] = df["Price"] * df["Stock"]
value_by_cat = df.groupby("Category")["Value"].sum().round(2)
print("\ninventory value by category:")
print(value_by_cat)

🐍

Loading PythonSetting up pandas & numpy...

# example 03 · filtering during the parse

Pass match= to skip tables you don't want. Useful when a page has nav tables, sidebar tables, etc.

1

2

3

4

5

6

7

8

9

10

🐍

Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo

Fetch /sample-data/electronics-store.html, parse the inventory table, and print the total stock count across all rows in the format 'total stock: N' where N is an integer.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Fetch the real inventory page and pull the table into a DataFrame.
import pandas as pd
from pyodide.http import pyfetch

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()

df = pd.read_html(html, attrs={"id": "inventory-table"})[0]
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)

print(df.head())
print(f"rows: {len(df)}")
print(f"total stock units across catalog: {df['Stock'].sum()}")


# Fetch /sample-data/electronics-store.html, parse the inventory table, and print the total stock count across all rows in the format 'total stock: N' where N is an integer.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# challenge 02/02todo

Same page. Find the single category with the highest total inventory value (Price * Stock summed). Print 'top category: NAME' where NAME is the category name.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Fetch the real inventory page and pull the table into a DataFrame.
import pandas as pd
from pyodide.http import pyfetch

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()

df = pd.read_html(html, attrs={"id": "inventory-table"})[0]
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)

print(df.head())
print(f"rows: {len(df)}")
print(f"total stock units across catalog: {df['Stock'].sum()}")


# Same page. Find the single category with the highest total inventory value (Price * Stock summed). Print 'top category: NAME' where NAME is the category name.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

The weekly sales report is distributed as an HTML table. Use pd.read_html to extract the data and clean up the currency formatting for analysis.

# task

Parse HTML Sales Report

# your code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import pandas as pd
import io

# Weekly sales report as HTML table
html_table = """
<table border="1">
    <thead>
        <tr><th>SalesRep</th><th>Region</th><th>Product</th><th>Units</th><th>Revenue</th></tr>
    </thead>
    <tbody>
        <tr><td>Alice Chen</td><td>North</td><td>Widget Pro</td><td>15</td><td>$749.85</td></tr>
        <tr><td>Bob Martinez</td><td>South</td><td>Gadget Plus</td><td>8</td><td>$239.92</td></tr>
        <tr><td>Carol Davis</td><td>East</td><td>Widget Pro</td><td>22</td><td>$1,099.78</td></tr>
        <tr><td>Dan Wilson</td><td>West</td><td>Super Tool</td><td>45</td><td>$899.55</td></tr>
        <tr><td>Eva Brown</td><td>North</td><td>Power Unit</td><td>10</td><td>$899.90</td></tr>
    </tbody>
</table>
"""

# Task:
# 1. Parse the HTML table with pd.read_html
# 2. Clean the Revenue column (remove $ and commas, convert to float)
# 3. Find the top performer by revenue
# 4. Print the DataFrame and the top performer

🐍

Loading PythonSetting up pandas & numpy...

Scraping Tables

pd.read_html

picking the right table

cleaning

what read_html can't do

Parse HTML Sales Report