python-mastery

# theory

the capstone

Every lesson up to this one handed you a partial query and asked you to fill in a blank. This one doesn't. You get a real dataset and three questions. You choose the approach.

the dataset

The Plotly diabetes dataset, 768 rows, 9 columns of medical metrics. The starter code pyfetches it for you and gives you a DataFrame named df. After that, you're on your own.

columns: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome
Outcome: 1 = diabetic, 0 = not

what open-ended means

No starter code beyond loading the data.
No structure-of-the-solution comments.
The validator only checks the printed answer, not how you arrived at it.
A peek-able example solution is hidden below in a collapsed details block. Try it cold first.

strategy from earlier lessons

You can keep using it. groupby, apply, vectorized math, NumPy when speed matters. The point of an open-ended challenge isn't that you need new techniques; it's that nobody is telling you which one to reach for.

a peek at the solution

It is one short paragraph of code (under 20 lines, total). If your draft is creeping past 40 lines, you're probably overbuilding.

<details> <summary><strong>peek the reference solution</strong> (try it without first)</summary>

import io
import pandas as pd
from pyodide.http import pyfetch

URL = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
resp = await pyfetch(URL)
df = pd.read_csv(io.StringIO(await resp.string()))

# Q1: positivity rate
print(f"diabetic rate: {df['Outcome'].mean() * 100:.1f}%")

# Q2: avg Glucose for diabetics vs non-diabetics
mean_by_outcome = df.groupby("Outcome")["Glucose"].mean().round(1)
print(f"diabetic mean Glucose: {mean_by_outcome[1]}")
print(f"non-diabetic mean Glucose: {mean_by_outcome[0]}")

# Q3: highest-BMI age bucket
df["age_bucket"] = pd.cut(df["Age"], bins=[20, 30, 40, 50, 60, 100],
                         labels=["20s", "30s", "40s", "50s", "60+"])
top_bucket = df.groupby("age_bucket", observed=True)["BMI"].mean().idxmax()
print(f"highest-BMI age bucket: {top_bucket}")

</details>

# examples [2]

# example 01 · complete pipeline example

Full end-to-end data analysis

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

import numpy as np

print("=" * 50)
print("STUDENT PERFORMANCE ANALYSIS")
print("=" * 50)

# 1. EXPLORE
print("\n1. DATA OVERVIEW")
print(f"   Total students: {len(students)}")
print(f"   Subjects: {students['subject'].unique().tolist()}")
print(f"   Grades: {students['grade'].unique().tolist()}")

# 2. CLEAN & VALIDATE
print("\n2. DATA QUALITY")
missing = students.isna().sum().sum()
print(f"   Missing values: {missing}")

# 3. ENRICH
students["performance"] = np.where(
    students["score"] >= 90, "Excellent",
    np.where(students["score"] >= 80, "Good", "Average")
)
students["gpa"] = students["grade"].map({"A": 4.0, "B": 3.0, "C": 2.0})

# 4. ANALYZE
print("\n3. KEY METRICS")
print(f"   Average score: {students['score'].mean():.1f}")
print(f"   Average GPA: {students['gpa'].mean():.2f}")

print("\n4. BY SUBJECT")
by_subject = students.groupby("subject").agg(
    count=("name", "count"),
    avg_score=("score", "mean"),
    top_score=("score", "max")
).round(1)
print(by_subject)

print("\n5. TOP PERFORMERS")
top = students[students["performance"] == "Excellent"][["name", "subject", "score"]]
print(top)

print("\n" + "=" * 50)
print("ANALYSIS COMPLETE")
print("=" * 50)

import numpy as np

print("=" * 50)
print("STUDENT PERFORMANCE ANALYSIS")
print("=" * 50)

# 1. EXPLORE
print("\n1. DATA OVERVIEW")
print(f"   Total students: {len(students)}")
print(f"   Subjects: {students['subject'].unique().tolist()}")
print(f"   Grades: {students['grade'].unique().tolist()}")

# 2. CLEAN & VALIDATE
print("\n2. DATA QUALITY")
missing = students.isna().sum().sum()
print(f"   Missing values: {missing}")

# 3. ENRICH
students["performance"] = np.where(
    students["score"] >= 90, "Excellent",
    np.where(students["score"] >= 80, "Good", "Average")
)
students["gpa"] = students["grade"].map({"A": 4.0, "B": 3.0, "C": 2.0})

# 4. ANALYZE
print("\n3. KEY METRICS")
print(f"   Average score: {students['score'].mean():.1f}")
print(f"   Average GPA: {students['gpa'].mean():.2f}")

print("\n4. BY SUBJECT")
by_subject = students.groupby("subject").agg(
    count=("name", "count"),
    avg_score=("score", "mean"),
    top_score=("score", "max")
).round(1)
print(by_subject)

print("\n5. TOP PERFORMERS")
top = students[students["performance"] == "Excellent"][["name", "subject", "score"]]
print(top)

print("\n" + "=" * 50)
print("ANALYSIS COMPLETE")
print("=" * 50)

🐍

Loading PythonSetting up pandas & numpy...

# example 02 · sales analysis pipeline

Business intelligence from sales data

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

print("SALES ANALYSIS REPORT")
print("=" * 40)

# Prepare data
sales["revenue"] = sales["price"] * sales["quantity"]
sales["date"] = pd.to_datetime(sales["date"])

# Summary stats
print("\nOVERALL METRICS:")
print(f"  Total Revenue: ${sales['revenue'].sum():,.2f}")
print(f"  Total Units: {sales['quantity'].sum():,}")
print(f"  Avg Order Value: ${sales['revenue'].mean():.2f}")

# By category
print("\nBY CATEGORY:")
cat_summary = sales.groupby("category").agg(
    revenue=("revenue", "sum"),
    units=("quantity", "sum"),
    products=("product", "count")
).sort_values("revenue", ascending=False)
print(cat_summary)

# Top products
print("\nTOP 3 PRODUCTS BY REVENUE:")
top_products = (sales.groupby("product")["revenue"]
    .sum()
    .sort_values(ascending=False)
    .head(3))
for product, rev in top_products.items():
    print(f"  {product}: ${rev:,.2f}")

🐍

Loading PythonSetting up pandas & numpy...

# challenges [3]

# challenge 01/03todo

Using the loaded df, print the share of rows where Outcome == 1 (diabetic) as a percentage with one decimal, in the format 'diabetic rate: X.X%'. The actual answer is around 34.9%. No hint, no starter steps. Pick your own approach.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Capstone. No scaffolding past the data load.
import io
import pandas as pd
from pyodide.http import pyfetch

URL = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
resp = await pyfetch(URL)
df = pd.read_csv(io.StringIO(await resp.string()))

print(f"loaded: {df.shape[0]} rows x {df.shape[1]} columns")
print(df.head(3))

# Your work below.
# The challenges hold the questions. Pick your own approach.


# Using the loaded df, print the share of rows where Outcome == 1 (diabetic) as a percentage with one decimal, in the format 'diabetic rate: X.X%'. The actual answer is around 34.9%. No hint, no starter steps. Pick your own approach.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# challenge 02/03todo

From the same df, compute the mean Glucose for diabetics (Outcome == 1) and for non-diabetics (Outcome == 0). Print 'diabetic mean Glucose: X.X' and 'non-diabetic mean Glucose: Y.Y' (one decimal each). Diabetics should be noticeably higher; that's the whole point of the dataset.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Capstone. No scaffolding past the data load.
import io
import pandas as pd
from pyodide.http import pyfetch

URL = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
resp = await pyfetch(URL)
df = pd.read_csv(io.StringIO(await resp.string()))

print(f"loaded: {df.shape[0]} rows x {df.shape[1]} columns")
print(df.head(3))

# Your work below.
# The challenges hold the questions. Pick your own approach.


# From the same df, compute the mean Glucose for diabetics (Outcome == 1) and for non-diabetics (Outcome == 0). Print 'diabetic mean Glucose: X.X' and 'non-diabetic mean Glucose: Y.Y' (one decimal each). Diabetics should be noticeably higher; that's the whole point of the dataset.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# challenge 03/03todo

Bucket the Age column into 20s, 30s, 40s, 50s, 60+ (use pd.cut). Find which bucket has the highest average BMI. Print 'highest-BMI age bucket: NAME' (one of 20s/30s/40s/50s/60+).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

# Capstone. No scaffolding past the data load.
import io
import pandas as pd
from pyodide.http import pyfetch

URL = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
resp = await pyfetch(URL)
df = pd.read_csv(io.StringIO(await resp.string()))

print(f"loaded: {df.shape[0]} rows x {df.shape[1]} columns")
print(df.head(3))

# Your work below.
# The challenges hold the questions. Pick your own approach.


# Bucket the Age column into 20s, 30s, 40s, 50s, 60+ (use pd.cut). Find which bucket has the highest average BMI. Print 'highest-BMI age bucket: NAME' (one of 20s/30s/40s/50s/60+).
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Sales Performance Dashboard · reward: 50 xp

# brief

Build the complete sales performance dashboard combining all techniques: revenue calculation, rep rankings, regional analysis, and category breakdowns. This is your capstone for the sales thread!

# task

Complete Sales Dashboard

# your code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

import pandas as pd
import numpy as np
import io

sales_csv = """SaleID,SalesRep,Region,Product,Category,Quantity,UnitPrice,SaleDate,CustomerSegment
S001,Alice Chen,North,Widget Pro,Electronics,15,49.99,2023-01-05,Enterprise
S002,Bob Martinez,South,Gadget Plus,Tools,8,29.99,2023-01-08,SMB
S003,Carol Davis,East,Widget Pro,Electronics,22,49.99,2023-01-10,Enterprise
S004,Dan Wilson,West,Super Tool,Tools,45,19.99,2023-01-12,Consumer
S005,Eva Brown,North,Power Unit,Electronics,10,89.99,2023-01-15,Enterprise
S006,Alice Chen,North,Gadget Plus,Tools,30,29.99,2023-01-18,SMB
S007,Bob Martinez,South,Widget Pro,Electronics,18,49.99,2023-01-20,Consumer
S008,Carol Davis,East,Super Tool,Tools,55,19.99,2023-02-01,SMB
S009,Dan Wilson,West,Power Unit,Electronics,8,89.99,2023-02-05,Enterprise
S010,Eva Brown,North,Widget Basic,Electronics,65,24.99,2023-02-10,Consumer
S011,Alice Chen,North,Super Tool,Tools,40,19.99,2023-02-15,Consumer
S012,Bob Martinez,South,Power Unit,Electronics,12,89.99,2023-02-20,Enterprise
S013,Carol Davis,East,Gadget Plus,Tools,25,29.99,2023-03-01,SMB
S014,Dan Wilson,West,Widget Pro,Electronics,20,49.99,2023-03-05,Enterprise
S015,Eva Brown,North,Widget Basic,Electronics,80,24.99,2023-03-10,Consumer"""

sales = pd.read_csv(io.StringIO(sales_csv))

# Task: Build the complete dashboard with these sections:
# 1. OVERVIEW: Total revenue, total units, avg sale value
# 2. BY REGION: Revenue and unit count per region
# 3. BY REP: Revenue ranking with performance tier (Star/Solid/Developing)
# 4. BY CATEGORY: Revenue and % of total per category
# 5. TOP PRODUCTS: Top 3 products by revenue
# Print a formatted dashboard report

import pandas as pd
import numpy as np
import io

sales_csv = """SaleID,SalesRep,Region,Product,Category,Quantity,UnitPrice,SaleDate,CustomerSegment
S001,Alice Chen,North,Widget Pro,Electronics,15,49.99,2023-01-05,Enterprise
S002,Bob Martinez,South,Gadget Plus,Tools,8,29.99,2023-01-08,SMB
S003,Carol Davis,East,Widget Pro,Electronics,22,49.99,2023-01-10,Enterprise
S004,Dan Wilson,West,Super Tool,Tools,45,19.99,2023-01-12,Consumer
S005,Eva Brown,North,Power Unit,Electronics,10,89.99,2023-01-15,Enterprise
S006,Alice Chen,North,Gadget Plus,Tools,30,29.99,2023-01-18,SMB
S007,Bob Martinez,South,Widget Pro,Electronics,18,49.99,2023-01-20,Consumer
S008,Carol Davis,East,Super Tool,Tools,55,19.99,2023-02-01,SMB
S009,Dan Wilson,West,Power Unit,Electronics,8,89.99,2023-02-05,Enterprise
S010,Eva Brown,North,Widget Basic,Electronics,65,24.99,2023-02-10,Consumer
S011,Alice Chen,North,Super Tool,Tools,40,19.99,2023-02-15,Consumer
S012,Bob Martinez,South,Power Unit,Electronics,12,89.99,2023-02-20,Enterprise
S013,Carol Davis,East,Gadget Plus,Tools,25,29.99,2023-03-01,SMB
S014,Dan Wilson,West,Widget Pro,Electronics,20,49.99,2023-03-05,Enterprise
S015,Eva Brown,North,Widget Basic,Electronics,80,24.99,2023-03-10,Consumer"""

sales = pd.read_csv(io.StringIO(sales_csv))

# Task: Build the complete dashboard with these sections:
# 1. OVERVIEW: Total revenue, total units, avg sale value
# 2. BY REGION: Revenue and unit count per region
# 3. BY REP: Revenue ranking with performance tier (Star/Solid/Developing)
# 4. BY CATEGORY: Revenue and % of total per category
# 5. TOP PRODUCTS: Top 3 products by revenue
# Print a formatted dashboard report

🐍

Loading PythonSetting up pandas & numpy...

Capstone Project

the capstone

the dataset

what open-ended means

strategy from earlier lessons

a peek at the solution

Complete Sales Dashboard