pyodide: loading…

[challenge]Web & APIs

Building a Data Pipeline

# theory

the pipeline

Three steps, three functions, one source of truth: an actual JSON endpoint. We'll use /posts and /users from jsonplaceholder.

  1. Extract with pyfetch. Hit each endpoint, check status, get JSON.
  2. Transform with pandas. Convert lists of dicts to DataFrames, merge on userId, classify.
  3. Load with print/return. Real pipelines write to CSV, a database, or a dashboard; in the browser we render.

Wrapping each phase in a function is what turns a pile of fetch calls into a pipeline. It also makes the pieces testable in isolation.

skeleton

async def extract():
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    return await posts_resp.json(), await users_resp.json()

def transform(posts_raw, users_raw):
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    return posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))

def load(df):
    counts = df.groupby("name").size().sort_values(ascending=False)
    print(counts)
    return counts

logging

The browser console can swallow errors mid-pipeline. A tiny log helper makes it obvious where a run stopped.

from datetime import datetime
def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

idempotence

A pipeline you can re-run safely is worth ten times one you can't. That usually means:

  • Extract is side-effect-free (just fetching, not mutating server state)
  • Transform takes raw data and returns new data, never mutates inputs
  • Load either overwrites the destination or uses an upsert key

# examples [3]

# example 01 · extract-only: pull and store before transforming

Real pipelines often separate extract from transform so the network step can be re-run independently. Cache the raw payload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · transform: classify by a derived field

Pull comments, group by the post they belong to, flag posts that attract long comments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · load: write the summary back to a CSV string

In a browser we can't write a real file, but we can build the same CSV that load() would write to disk.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Fetch all todos from https://jsonplaceholder.typicode.com/todos. Group by userId and compute the number of completed todos per user. Print the user with the most completed todos in the format 'top user: N has K completed' where N and K are integers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Build a full extract/transform/load pipeline that: extracts /posts and /users from jsonplaceholder, merges them on userId, and prints 'pipeline ok: N rows for M users' (N is total joined rows, M is unique authors).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
🐍
Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

Create a complete ETL pipeline for the sales dashboard. Extract data from the source, transform it with revenue calculations and categorization, then load a summary report.

# task

Build Sales ETL Pipeline

# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
🐍
Loading PythonSetting up pandas & numpy...