[concept]Data Manipulation (WCTC)
Working with Numbers & Types
# theory
checking dtypes
First thing you wanna do with any new dataset is check what types you're working with.
df.dtypes # see all column types
df["col"].dtype # just one column
df.info() # types plus null counts
Common types you'll see:
int64; integersfloat64; decimalsobject; usually strings (the label is misleading)bool; True/Falsedatetime64; dates
astype
When pandas reads a CSV, it guesses types. Sometimes it guesses wrong.
# Convert to different types
df["price"] = df["price"].astype(float)
df["quantity"] = df["quantity"].astype(int)
df["active"] = df["active"].astype(bool)
If the conversion isn't possible (like turning "hello" into an int), it'll throw an error. Which is actually helpful.
pd.to_numeric for messy data
Here's the thing; astype is strict. If you have even one bad value, it fails. pd.to_numeric is more forgiving.
# errors='coerce' turns bad values into NaN instead of crashing
df["price"] = pd.to_numeric(df["price"], errors="coerce")
# errors='ignore' just leaves bad values as-is
df["price"] = pd.to_numeric(df["price"], errors="ignore")
I use errors="coerce" way more than I expected. Real data is messy.
rounding
df["price"].round(2) # round to 2 decimal places
df["price"].round(0) # round to whole number
df["price"].round(-1) # round to nearest 10
You can also use numpy functions:
import numpy as np
np.floor(df["price"]) # always round down
np.ceil(df["price"]) # always round up
formatting via apply
For display formatting, apply is your friend.
# Format as currency
df["price"].apply(lambda x: f"${x:,.2f}")
# Format as percentage
df["rate"].apply(lambda x: f"{x:.1%}")
# Add commas to large numbers
df["population"].apply(lambda x: f"{x:,}")
mixed-type columns
Sometimes a column has both numbers and text. This happens more than you'd think.
# Check if values are numeric
pd.to_numeric(df["col"], errors="coerce").notna()
# Find the non-numeric rows
mask = pd.to_numeric(df["col"], errors="coerce").isna()
print(df[mask]) # shows you the problem rows
This is super useful for debugging data quality issues.
# examples [3]
Handle a column that has some non-numeric values mixed in
Make numbers look nice for display
Sometimes you need multiple conversions to get data right
# challenges [2]