[challenge]Functions & Apply
Capstone Project
# theory
the capstone
Every lesson up to this one handed you a partial query and asked you to fill in a blank. This one doesn't. You get a real dataset and three questions. You choose the approach.
the dataset
The Plotly diabetes dataset, 768 rows, 9 columns of medical metrics. The starter code pyfetches it for you and gives you a DataFrame named df. After that, you're on your own.
columns: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome
Outcome: 1 = diabetic, 0 = not
what open-ended means
- No starter code beyond loading the data.
- No structure-of-the-solution comments.
- The validator only checks the printed answer, not how you arrived at it.
- A peek-able example solution is hidden below in a collapsed details block. Try it cold first.
strategy from earlier lessons
You can keep using it. groupby, apply, vectorized math, NumPy when speed matters. The point of an open-ended challenge isn't that you need new techniques; it's that nobody is telling you which one to reach for.
a peek at the solution
It is one short paragraph of code (under 20 lines, total). If your draft is creeping past 40 lines, you're probably overbuilding.
<details> <summary><strong>peek the reference solution</strong> (try it without first)</summary>import io
import pandas as pd
from pyodide.http import pyfetch
URL = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
resp = await pyfetch(URL)
df = pd.read_csv(io.StringIO(await resp.string()))
# Q1: positivity rate
print(f"diabetic rate: {df['Outcome'].mean() * 100:.1f}%")
# Q2: avg Glucose for diabetics vs non-diabetics
mean_by_outcome = df.groupby("Outcome")["Glucose"].mean().round(1)
print(f"diabetic mean Glucose: {mean_by_outcome[1]}")
print(f"non-diabetic mean Glucose: {mean_by_outcome[0]}")
# Q3: highest-BMI age bucket
df["age_bucket"] = pd.cut(df["Age"], bins=[20, 30, 40, 50, 60, 100],
labels=["20s", "30s", "40s", "50s", "60+"])
top_bucket = df.groupby("age_bucket", observed=True)["BMI"].mean().idxmax()
print(f"highest-BMI age bucket: {top_bucket}")
</details># examples [2]
Full end-to-end data analysis
Business intelligence from sales data
# challenges [3]
# project
# project-challenge
thread: Sales Performance Dashboard · reward: 50 xp
# brief
Build the complete sales performance dashboard combining all techniques: revenue calculation, rep rankings, regional analysis, and category breakdowns. This is your capstone for the sales thread!
# task