python-mastery

# theory

BeautifulSoup

BeautifulSoup parses HTML/XML, letting you extract data from web pages:

from bs4 import BeautifulSoup

html = "<html><body><h1>Title</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

print(soup.h1.text)  # "Title"
print(soup.p.text)   # "Content"

finding elements

# Find first matching element
soup.find("div")
soup.find("div", class_="container")
soup.find("a", href="/page")

# Find all matching elements
soup.find_all("p")
soup.find_all("a")
soup.find_all("div", class_="item")

extracting

element = soup.find("a")
element.text           # Text content
element.get("href")    # Attribute value
element["href"]        # Same, but may raise error
element.get_text(strip=True)  # Clean text

css selectors

soup.select("div.container")    # Class selector
soup.select("#main")            # ID selector
soup.select("div p")            # Descendant
soup.select("div > p")          # Direct child
soup.select("a[href]")          # Has attribute

scraping a list

items = []
for row in soup.find_all("tr"):
    cols = row.find_all("td")
    if cols:
        items.append({
            "name": cols[0].text.strip(),
            "price": cols[1].text.strip()
        })

with pyfetch

In this site we fetch HTML over the network with pyfetch (same-origin works without any CORS hassle), then feed response.string() to BeautifulSoup. Same code as a regular Python script except for the await.

from pyodide.http import pyfetch
from bs4 import BeautifulSoup

response = await pyfetch("/sample-data/electronics-store.html")
html = await response.string()
soup = BeautifulSoup(html, "html.parser")

print(soup.find("h1").get_text())

There's a real HTML page bundled with this site at /sample-data/electronics-store.html for the examples below. It has an inventory table, category list, and "featured" section so you can practice every common selector against actual markup, not a synthetic string.

# examples [3]

# example 01 · finding by tag and by class

find returns the first match; find_all returns every match. Selectors land on the same elements.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

from pyodide.http import pyfetch
from bs4 import BeautifulSoup

resp = await pyfetch("/sample-data/electronics-store.html")
soup = BeautifulSoup(await resp.string(), "html.parser")

# First h1 on the page
print("page heading:", soup.find("h1").get_text())

# All section headings
for h2 in soup.find_all("h2"):
    print("section:", h2.get_text())

# Featured list items via CSS selector
for li in soup.select("ul.featured-list li"):
    print("featured:", li.get_text(strip=True))

🐍

Loading PythonSetting up pandas & numpy...

# example 02 · reading attributes off elements

Real scraping needs the attribute, not just the text. element["attr"] or element.get("attr") both work.

1

2

3

4

5

6

7

8

9

10

11

12

13

from pyodide.http import pyfetch
from bs4 import BeautifulSoup

resp = await pyfetch("/sample-data/electronics-store.html")
soup = BeautifulSoup(await resp.string(), "html.parser")

# Each featured item carries the SKU in a data attribute.
featured_skus = [li.get("data-sku") for li in soup.select("ul.featured-list li")]
print("featured skus:", featured_skus)

# Pull category names from a structured list rather than free-form text
categories = [c.get_text(strip=True) for c in soup.select("li.cat")]
print("count of categories:", len(categories))

🐍

Loading PythonSetting up pandas & numpy...

# example 03 · walking a table by hand

Lesson 29 uses pd.read_html for this. Doing it by hand once teaches you what read_html is doing under the hood.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import pandas as pd
from pyodide.http import pyfetch
from bs4 import BeautifulSoup

resp = await pyfetch("/sample-data/electronics-store.html")
soup = BeautifulSoup(await resp.string(), "html.parser")

table = soup.find("table", id="inventory-table")
headers = [th.get_text(strip=True) for th in table.find_all("th")]

rows = []
for tr in table.find("tbody").find_all("tr"):
    cells = [td.get_text(strip=True) for td in tr.find_all("td")]
    rows.append(dict(zip(headers, cells)))

df = pd.DataFrame(rows)
df["Price"] = df["Price"].astype(float)
df["Stock"] = df["Stock"].astype(int)
print(df.head())
print(f"total inventory rows: {len(df)}")

🐍

Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo

Fetch /sample-data/electronics-store.html, parse it, and print the page's h1 text exactly as it appears.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Fetch the sample HTML and parse it with BeautifulSoup.
from pyodide.http import pyfetch
from bs4 import BeautifulSoup

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()
soup = BeautifulSoup(html, "html.parser")

# Page title
print("title:", soup.find("h1").get_text())

# Category list
categories = [li.get_text() for li in soup.select("ul#category-list li.cat")]
print("categories:", categories)


# Fetch /sample-data/electronics-store.html, parse it, and print the page's h1 text exactly as it appears.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# challenge 02/02todo

Fetch the same page and print how many distinct categories appear in the category list, in the format 'categories: N' where N is an integer.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Fetch the sample HTML and parse it with BeautifulSoup.
from pyodide.http import pyfetch
from bs4 import BeautifulSoup

resp = await pyfetch("/sample-data/electronics-store.html")
html = await resp.string()
soup = BeautifulSoup(html, "html.parser")

# Page title
print("title:", soup.find("h1").get_text())

# Category list
categories = [li.get_text() for li in soup.select("ul#category-list li.cat")]
print("categories:", categories)


# Fetch the same page and print how many distinct categories appear in the category list, in the format 'categories: N' where N is an integer.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

The legacy system exports sales data as HTML. Extract the sales information from the HTML table structure using regex patterns.

# task

Extract Sales from HTML

# your code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

import pandas as pd
import io
import re

# HTML sales report from legacy system
html_report = """
<div class="sales-report">
    <h2>Q1 Sales Summary</h2>
    <div class="sale-card">
        <span class="rep">Alice Chen</span>
        <span class="region">North</span>
        <span class="product">Widget Pro</span>
        <span class="revenue">$749.85</span>
    </div>
    <div class="sale-card">
        <span class="rep">Bob Martinez</span>
        <span class="region">South</span>
        <span class="product">Gadget Plus</span>
        <span class="revenue">$239.92</span>
    </div>
    <div class="sale-card">
        <span class="rep">Carol Davis</span>
        <span class="region">East</span>
        <span class="product">Widget Pro</span>
        <span class="revenue">$1099.78</span>
    </div>
    <div class="sale-card">
        <span class="rep">Dan Wilson</span>
        <span class="region">West</span>
        <span class="product">Super Tool</span>
        <span class="revenue">$899.55</span>
    </div>
</div>
"""

# Task:
# 1. Extract rep names, regions, products, and revenues using regex
# 2. Parse revenue strings (remove $ and convert to float)
# 3. Print each sale and the total revenue

import pandas as pd
import io
import re

# HTML sales report from legacy system
html_report = """
<div class="sales-report">
    <h2>Q1 Sales Summary</h2>
    <div class="sale-card">
        <span class="rep">Alice Chen</span>
        <span class="region">North</span>
        <span class="product">Widget Pro</span>
        <span class="revenue">$749.85</span>
    </div>
    <div class="sale-card">
        <span class="rep">Bob Martinez</span>
        <span class="region">South</span>
        <span class="product">Gadget Plus</span>
        <span class="revenue">$239.92</span>
    </div>
    <div class="sale-card">
        <span class="rep">Carol Davis</span>
        <span class="region">East</span>
        <span class="product">Widget Pro</span>
        <span class="revenue">$1099.78</span>
    </div>
    <div class="sale-card">
        <span class="rep">Dan Wilson</span>
        <span class="region">West</span>
        <span class="product">Super Tool</span>
        <span class="revenue">$899.55</span>
    </div>
</div>
"""

# Task:
# 1. Extract rep names, regions, products, and revenues using regex
# 2. Parse revenue strings (remove $ and convert to float)
# 3. Print each sale and the total revenue

🐍

Loading PythonSetting up pandas & numpy...

BeautifulSoup Basics

BeautifulSoup

finding elements

extracting

css selectors

scraping a list

with pyfetch

Extract Sales from HTML