python-mastery

# theory

the pipeline

Three steps, three functions, one source of truth: an actual JSON endpoint. We'll use /posts and /users from jsonplaceholder.

Extract with pyfetch. Hit each endpoint, check status, get JSON.
Transform with pandas. Convert lists of dicts to DataFrames, merge on userId, classify.
Load with print/return. Real pipelines write to CSV, a database, or a dashboard; in the browser we render.

Wrapping each phase in a function is what turns a pile of fetch calls into a pipeline. It also makes the pieces testable in isolation.

skeleton

async def extract():
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    return await posts_resp.json(), await users_resp.json()

def transform(posts_raw, users_raw):
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    return posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))

def load(df):
    counts = df.groupby("name").size().sort_values(ascending=False)
    print(counts)
    return counts

logging

The browser console can swallow errors mid-pipeline. A tiny log helper makes it obvious where a run stopped.

from datetime import datetime
def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

idempotence

A pipeline you can re-run safely is worth ten times one you can't. That usually means:

Extract is side-effect-free (just fetching, not mutating server state)
Transform takes raw data and returns new data, never mutates inputs
Load either overwrites the destination or uses an upsert key

# examples [3]

# example 01 · extract-only: pull and store before transforming

Real pipelines often separate extract from transform so the network step can be re-run independently. Cache the raw payload.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

import pandas as pd
from pyodide.http import pyfetch

async def extract():
    resp = await pyfetch("https://jsonplaceholder.typicode.com/comments")
    if resp.status != 200:
        raise RuntimeError(f"comments: status {resp.status}")
    return await resp.json()

raw = await extract()
print(f"raw rows: {len(raw)}")
print("first record keys:", list(raw[0].keys()))

# Hold the raw payload as a frame so transform can be run repeatedly without re-fetching.
raw_df = pd.DataFrame(raw)
print(raw_df.head(2))

🐍

Loading PythonSetting up pandas & numpy...

# example 02 · transform: classify by a derived field

Pull comments, group by the post they belong to, flag posts that attract long comments.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

import pandas as pd
from pyodide.http import pyfetch

resp = await pyfetch("https://jsonplaceholder.typicode.com/comments")
comments = pd.DataFrame(await resp.json())

per_post = comments.groupby("postId").agg(
    comment_count=("id", "count"),
    avg_body_chars=("body", lambda s: s.str.len().mean()),
).round(0).astype(int)

per_post["engagement"] = pd.cut(
    per_post["comment_count"],
    bins=[-1, 2, 5, 100],
    labels=["low", "medium", "high"],
)
print(per_post.head(10))
print("\nposts by engagement tier:")
print(per_post["engagement"].value_counts())

🐍

Loading PythonSetting up pandas & numpy...

# example 03 · load: write the summary back to a CSV string

In a browser we can't write a real file, but we can build the same CSV that load() would write to disk.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import pandas as pd
import io
from pyodide.http import pyfetch

resp = await pyfetch("https://jsonplaceholder.typicode.com/todos")
todos = pd.DataFrame(await resp.json())

summary = todos.groupby("userId").agg(
    total=("id", "count"),
    completed=("completed", "sum"),
)
summary["completion_pct"] = (summary["completed"] / summary["total"] * 100).round(1)

# What you'd write to a file in a real script:
buf = io.StringIO()
summary.to_csv(buf)
csv_text = buf.getvalue()

print("first 200 chars of the CSV that load() would persist:\n")
print(csv_text[:200])

🐍

Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo

Fetch all todos from https://jsonplaceholder.typicode.com/todos. Group by userId and compute the number of completed todos per user. Print the user with the most completed todos in the format 'top user: N has K completed' where N and K are integers.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

# Live ETL pipeline: jsonplaceholder posts + users → per-user post counts.
import pandas as pd
from datetime import datetime
from pyodide.http import pyfetch

def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

async def extract():
    log("EXTRACT", "fetching posts and users...")
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    posts = await posts_resp.json()
    users = await users_resp.json()
    log("EXTRACT", f"{len(posts)} posts, {len(users)} users")
    return posts, users

def transform(posts_raw, users_raw):
    log("TRANSFORM", "merging on userId and tagging long posts...")
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    joined = posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))
    joined["body_chars"] = joined["body"].str.len()
    joined["long_post"] = joined["body_chars"] > joined["body_chars"].median()
    log("TRANSFORM", f"{len(joined)} rows after merge")
    return joined

def load(df):
    log("LOAD", "summarizing per user...")
    by_user = df.groupby("name").agg(
        posts=("title", "count"),
        long_posts=("long_post", "sum"),
        avg_body_chars=("body_chars", "mean"),
    ).round(0).astype(int).sort_values("posts", ascending=False)
    print(by_user)
    return by_user

raw_posts, raw_users = await extract()
joined = transform(raw_posts, raw_users)
summary = load(joined)


# Fetch all todos from https://jsonplaceholder.typicode.com/todos. Group by userId and compute the number of completed todos per user. Print the user with the most completed todos in the format 'top user: N has K completed' where N and K are integers.
# Your code here:

# Live ETL pipeline: jsonplaceholder posts + users → per-user post counts.
import pandas as pd
from datetime import datetime
from pyodide.http import pyfetch

def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

async def extract():
    log("EXTRACT", "fetching posts and users...")
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    posts = await posts_resp.json()
    users = await users_resp.json()
    log("EXTRACT", f"{len(posts)} posts, {len(users)} users")
    return posts, users

def transform(posts_raw, users_raw):
    log("TRANSFORM", "merging on userId and tagging long posts...")
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    joined = posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))
    joined["body_chars"] = joined["body"].str.len()
    joined["long_post"] = joined["body_chars"] > joined["body_chars"].median()
    log("TRANSFORM", f"{len(joined)} rows after merge")
    return joined

def load(df):
    log("LOAD", "summarizing per user...")
    by_user = df.groupby("name").agg(
        posts=("title", "count"),
        long_posts=("long_post", "sum"),
        avg_body_chars=("body_chars", "mean"),
    ).round(0).astype(int).sort_values("posts", ascending=False)
    print(by_user)
    return by_user

raw_posts, raw_users = await extract()
joined = transform(raw_posts, raw_users)
summary = load(joined)

# Fetch all todos from https://jsonplaceholder.typicode.com/todos. Group by userId and compute the number of completed todos per user. Print the user with the most completed todos in the format 'top user: N has K completed' where N and K are integers.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# challenge 02/02todo

Build a full extract/transform/load pipeline that: extracts /posts and /users from jsonplaceholder, merges them on userId, and prints 'pipeline ok: N rows for M users' (N is total joined rows, M is unique authors).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

# Live ETL pipeline: jsonplaceholder posts + users → per-user post counts.
import pandas as pd
from datetime import datetime
from pyodide.http import pyfetch

def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

async def extract():
    log("EXTRACT", "fetching posts and users...")
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    posts = await posts_resp.json()
    users = await users_resp.json()
    log("EXTRACT", f"{len(posts)} posts, {len(users)} users")
    return posts, users

def transform(posts_raw, users_raw):
    log("TRANSFORM", "merging on userId and tagging long posts...")
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    joined = posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))
    joined["body_chars"] = joined["body"].str.len()
    joined["long_post"] = joined["body_chars"] > joined["body_chars"].median()
    log("TRANSFORM", f"{len(joined)} rows after merge")
    return joined

def load(df):
    log("LOAD", "summarizing per user...")
    by_user = df.groupby("name").agg(
        posts=("title", "count"),
        long_posts=("long_post", "sum"),
        avg_body_chars=("body_chars", "mean"),
    ).round(0).astype(int).sort_values("posts", ascending=False)
    print(by_user)
    return by_user

raw_posts, raw_users = await extract()
joined = transform(raw_posts, raw_users)
summary = load(joined)


# Build a full extract/transform/load pipeline that: extracts /posts and /users from jsonplaceholder, merges them on userId, and prints 'pipeline ok: N rows for M users' (N is total joined rows, M is unique authors).
# Your code here:

# Live ETL pipeline: jsonplaceholder posts + users → per-user post counts.
import pandas as pd
from datetime import datetime
from pyodide.http import pyfetch

def log(step, msg):
    print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")

async def extract():
    log("EXTRACT", "fetching posts and users...")
    posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
    users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
    if posts_resp.status != 200 or users_resp.status != 200:
        raise RuntimeError("upstream API not available")
    posts = await posts_resp.json()
    users = await users_resp.json()
    log("EXTRACT", f"{len(posts)} posts, {len(users)} users")
    return posts, users

def transform(posts_raw, users_raw):
    log("TRANSFORM", "merging on userId and tagging long posts...")
    posts = pd.DataFrame(posts_raw)
    users = pd.DataFrame(users_raw)[["id", "name"]]
    joined = posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))
    joined["body_chars"] = joined["body"].str.len()
    joined["long_post"] = joined["body_chars"] > joined["body_chars"].median()
    log("TRANSFORM", f"{len(joined)} rows after merge")
    return joined

def load(df):
    log("LOAD", "summarizing per user...")
    by_user = df.groupby("name").agg(
        posts=("title", "count"),
        long_posts=("long_post", "sum"),
        avg_body_chars=("body_chars", "mean"),
    ).round(0).astype(int).sort_values("posts", ascending=False)
    print(by_user)
    return by_user

raw_posts, raw_users = await extract()
joined = transform(raw_posts, raw_users)
summary = load(joined)

# Build a full extract/transform/load pipeline that: extracts /posts and /users from jsonplaceholder, merges them on userId, and prints 'pipeline ok: N rows for M users' (N is total joined rows, M is unique authors).
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

Create a complete ETL pipeline for the sales dashboard. Extract data from the source, transform it with revenue calculations and categorization, then load a summary report.

# task

Build Sales ETL Pipeline

# your code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

import pandas as pd
import io

sales_csv = """SaleID,SalesRep,Region,Product,Category,Quantity,UnitPrice,SaleDate,CustomerSegment
S001,Alice Chen,North,Widget Pro,Electronics,15,49.99,2023-01-05,Enterprise
S002,Bob Martinez,South,Gadget Plus,Tools,8,29.99,2023-01-08,SMB
S003,Carol Davis,East,Widget Pro,Electronics,22,49.99,2023-01-10,Enterprise
S004,Dan Wilson,West,Super Tool,Tools,45,19.99,2023-01-12,Consumer
S005,Eva Brown,North,Power Unit,Electronics,10,89.99,2023-01-15,Enterprise
S006,Alice Chen,North,Gadget Plus,Tools,30,29.99,2023-01-18,SMB
S007,Bob Martinez,South,Widget Pro,Electronics,18,49.99,2023-01-20,Consumer
S008,Carol Davis,East,Super Tool,Tools,55,19.99,2023-02-01,SMB
S009,Dan Wilson,West,Power Unit,Electronics,8,89.99,2023-02-05,Enterprise
S010,Eva Brown,North,Widget Basic,Electronics,65,24.99,2023-02-10,Consumer
S011,Alice Chen,North,Super Tool,Tools,40,19.99,2023-02-15,Consumer
S012,Bob Martinez,South,Power Unit,Electronics,12,89.99,2023-02-20,Enterprise
S013,Carol Davis,East,Gadget Plus,Tools,25,29.99,2023-03-01,SMB
S014,Dan Wilson,West,Widget Pro,Electronics,20,49.99,2023-03-05,Enterprise
S015,Eva Brown,North,Widget Basic,Electronics,80,24.99,2023-03-10,Consumer"""

# Task: Build a complete ETL pipeline with 3 functions:
# 1. extract() - Load and return the sales DataFrame
# 2. transform(df) - Add Revenue, SaleTier (High/Medium/Low based on revenue)
# 3. load(df) - Print summary by Region and by SaleTier
# Run the full pipeline and print results

import pandas as pd
import io

sales_csv = """SaleID,SalesRep,Region,Product,Category,Quantity,UnitPrice,SaleDate,CustomerSegment
S001,Alice Chen,North,Widget Pro,Electronics,15,49.99,2023-01-05,Enterprise
S002,Bob Martinez,South,Gadget Plus,Tools,8,29.99,2023-01-08,SMB
S003,Carol Davis,East,Widget Pro,Electronics,22,49.99,2023-01-10,Enterprise
S004,Dan Wilson,West,Super Tool,Tools,45,19.99,2023-01-12,Consumer
S005,Eva Brown,North,Power Unit,Electronics,10,89.99,2023-01-15,Enterprise
S006,Alice Chen,North,Gadget Plus,Tools,30,29.99,2023-01-18,SMB
S007,Bob Martinez,South,Widget Pro,Electronics,18,49.99,2023-01-20,Consumer
S008,Carol Davis,East,Super Tool,Tools,55,19.99,2023-02-01,SMB
S009,Dan Wilson,West,Power Unit,Electronics,8,89.99,2023-02-05,Enterprise
S010,Eva Brown,North,Widget Basic,Electronics,65,24.99,2023-02-10,Consumer
S011,Alice Chen,North,Super Tool,Tools,40,19.99,2023-02-15,Consumer
S012,Bob Martinez,South,Power Unit,Electronics,12,89.99,2023-02-20,Enterprise
S013,Carol Davis,East,Gadget Plus,Tools,25,29.99,2023-03-01,SMB
S014,Dan Wilson,West,Widget Pro,Electronics,20,49.99,2023-03-05,Enterprise
S015,Eva Brown,North,Widget Basic,Electronics,80,24.99,2023-03-10,Consumer"""

# Task: Build a complete ETL pipeline with 3 functions:
# 1. extract() - Load and return the sales DataFrame
# 2. transform(df) - Add Revenue, SaleTier (High/Medium/Low based on revenue)
# 3. load(df) - Print summary by Region and by SaleTier
# Run the full pipeline and print results

🐍

Loading PythonSetting up pandas & numpy...

Building a Data Pipeline

the pipeline

skeleton

logging

idempotence

Build Sales ETL Pipeline