[concept]Data Manipulation (WCTC)
String Methods in Pandas
# theory
the .str accessor
Pandas exposes a str accessor that lets you call string methods across an entire column at once. It's one of the most useful features once you know it's there.
df["name"].str.upper() # uppercase everything
df["name"].str.lower() # lowercase everything
df["name"].str.strip() # remove leading/trailing whitespace
Without the .str you'd have to loop through rows. With it, everything just works on the whole column.
common methods
Changing case:
df["text"].str.upper() # ALL CAPS
df["text"].str.lower() # all lowercase
df["text"].str.title() # Title Case
df["text"].str.capitalize() # First letter only
Cleaning up whitespace:
df["text"].str.strip() # both ends
df["text"].str.lstrip() # left side only
df["text"].str.rstrip() # right side only
Replacing text:
df["text"].str.replace("old", "new")
df["text"].str.replace(r"\d+", "", regex=True) # remove all digits
searching with contains
This one's super handy for filtering. It returns True/False for each row.
# Find rows where name contains "son"
df[df["name"].str.contains("son")]
# Case insensitive
df[df["name"].str.contains("bob", case=False)]
# Use regex
df[df["email"].str.contains(r"@gmail\.com$", regex=True)]
Watch out though; contains throws errors if you have NaN values. Use na=False to avoid that:
df[df["name"].str.contains("son", na=False)]
extract
If you need to pull out specific parts of strings, extract uses regex groups.
# Extract the domain from emails
df["email"].str.extract(r"@(.+)")
# Extract area code from phone numbers
df["phone"].str.extract(r"\((\d{3})\)")
The parentheses in the regex define what gets captured. This part tripped me up at first but it makes sense once you see it.
splitting
# Split into a list
df["name"].str.split(" ")
# Split into separate columns
df["name"].str.split(" ", expand=True)
# Get just the first part
df["name"].str.split(" ").str[0]
That last one chains the str accessor twice. A bit weird looking but it works.
# examples [3]
Strip whitespace and standardize case; happens all the time with real data
Find all rows where a column contains certain text
Pull apart strings when you need specific pieces
# challenges [2]