pyodide: loading…

[concept]Web & APIs

BeautifulSoup Basics

# theory

BeautifulSoup

BeautifulSoup parses HTML/XML, letting you extract data from web pages:

from bs4 import BeautifulSoup

html = "<html><body><h1>Title</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

print(soup.h1.text)  # "Title"
print(soup.p.text)   # "Content"

finding elements

# Find first matching element
soup.find("div")
soup.find("div", class_="container")
soup.find("a", href="/page")

# Find all matching elements
soup.find_all("p")
soup.find_all("a")
soup.find_all("div", class_="item")

extracting

element = soup.find("a")
element.text           # Text content
element.get("href")    # Attribute value
element["href"]        # Same, but may raise error
element.get_text(strip=True)  # Clean text

css selectors

soup.select("div.container")    # Class selector
soup.select("#main")            # ID selector
soup.select("div p")            # Descendant
soup.select("div > p")          # Direct child
soup.select("a[href]")          # Has attribute

scraping a list

items = []
for row in soup.find_all("tr"):
    cols = row.find_all("td")
    if cols:
        items.append({
            "name": cols[0].text.strip(),
            "price": cols[1].text.strip()
        })

with pyfetch

In this site we fetch HTML over the network with pyfetch (same-origin works without any CORS hassle), then feed response.string() to BeautifulSoup. Same code as a regular Python script except for the await.

from pyodide.http import pyfetch
from bs4 import BeautifulSoup

response = await pyfetch("/sample-data/electronics-store.html")
html = await response.string()
soup = BeautifulSoup(html, "html.parser")

print(soup.find("h1").get_text())

There's a real HTML page bundled with this site at /sample-data/electronics-store.html for the examples below. It has an inventory table, category list, and "featured" section so you can practice every common selector against actual markup, not a synthetic string.

# examples [3]

# example 01 · finding by tag and by class

find returns the first match; find_all returns every match. Selectors land on the same elements.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · reading attributes off elements

Real scraping needs the attribute, not just the text. element["attr"] or element.get("attr") both work.

1
2
3
4
5
6
7
8
9
10
11
12
13
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · walking a table by hand

Lesson 29 uses pd.read_html for this. Doing it by hand once teaches you what read_html is doing under the hood.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Fetch /sample-data/electronics-store.html, parse it, and print the page's h1 text exactly as it appears.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Fetch the same page and print how many distinct categories appear in the category list, in the format 'categories: N' where N is an integer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

The legacy system exports sales data as HTML. Extract the sales information from the HTML table structure using regex patterns.

# task

Extract Sales from HTML

# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
🐍
Loading PythonSetting up pandas & numpy...