[concept]Data Manipulation (WCTC)
Combining DataFrames
# theory
combining DataFrames
Real data is almost never in one table. You've got customers in one place, orders in another, products somewhere else. Combining them is like half the job.
merge
merge is the main one. It's basically SQL joins.
# Join on a common column
result = pd.merge(orders, customers, on="customer_id")
# Different column names in each table
result = pd.merge(orders, customers, left_on="cust_id", right_on="customer_id")
join types
# Inner join (default); only matching rows
pd.merge(df1, df2, on="id", how="inner")
# Left join; all rows from left, matching from right
pd.merge(df1, df2, on="id", how="left")
# Right join; all rows from right, matching from left
pd.merge(df1, df2, on="id", how="right")
# Outer join; all rows from both
pd.merge(df1, df2, on="id", how="outer")
Inner is the default and what you want most of the time. Left join is second most common, the "give me all my orders, and whatever customer info exists" pattern.
duplicate column names
If both DataFrames have a column with the same name (besides the join key), you get suffixes:
# By default you get _x and _y
pd.merge(df1, df2, on="id")
# Creates name_x and name_y
# Custom suffixes
pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))
concat
Use concat when you want to stack DataFrames on top of each other (or side by side).
# Stack vertically (add more rows)
combined = pd.concat([df1, df2])
# Stack horizontally (add more columns)
combined = pd.concat([df1, df2], axis=1)
Watch the index though; by default it keeps the original indexes. Use ignore_index=True to reset:
combined = pd.concat([df1, df2], ignore_index=True)
join (index-based)
join is basically merge but uses the index by default.
# Join on index
df1.join(df2)
# Join df1's index to df2's column
df1.join(df2.set_index("key"), on="key")
merge gets used more often than join because it's more explicit about what's happening.
# examples [3]
See the difference; inner drops unmatched, left keeps all from left side
Stack data from multiple sources together
Sometimes you need to match on more than one column
# challenges [2]