[concept]Data Cleaning
Datetime Handling
# theory
parsing dates
Most CSVs give you dates as strings. pd.to_datetime parses them. It's smart about common formats: 2024-01-15, 01/15/2024, Jan 15 2024 all work.
df["date"] = pd.to_datetime(df["date"])
If parsing fails on a row, you'll get an error. Use errors="coerce" to turn bad rows into NaT (missing) instead:
df["date"] = pd.to_datetime(df["date"], errors="coerce")
If the format is weird and pandas can't guess, tell it:
df["date"] = pd.to_datetime(df["date"], format="%d-%b-%Y")
the .dt accessor
Once a column is datetime, .dt gives you parts:
df["date"].dt.year # 2024
df["date"].dt.month # 1
df["date"].dt.day # 15
df["date"].dt.day_name() # "Monday"
df["date"].dt.weekday # 0 (Monday)
df["date"].dt.is_month_end # True/False
date arithmetic
Subtracting two datetimes gives you a Timedelta:
df["days_since"] = (pd.Timestamp.today() - df["date"]).dt.days
Adding offsets:
df["next_week"] = df["date"] + pd.Timedelta(days=7)
df["next_month"] = df["date"] + pd.DateOffset(months=1)
filtering by date
Once parsed, comparisons just work:
recent = df[df["date"] >= "2024-01-01"]
last_30_days = df[df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=30)]
formatting
strftime formats a datetime as a string:
df["date_str"] = df["date"].dt.strftime("%Y-%m-%d")
df["pretty"] = df["date"].dt.strftime("%B %d, %Y") # "January 15, 2024"
Common format codes: %Y year, %m month, %d day, %H hour (24h), %M minute, %B month name, %A day name.
# examples [3]
errors='coerce' converts unparseable values to NaT instead of raising
Day name, month, weekday flag, all from the same column
Today minus a Timedelta is the cutoff
# challenges [2]