The easy way to clean data in Python
Hey — Tom here.
Nothing kills an analysis faster than messy data.
You've got duplicates, missing values, inconsistent formats, and columns that make no sense.
Before you know it, you're spending 80% of your time wrestling with data instead of extracting insights from it.
Python makes data cleaning powerful, but most tutorials throw you into it without a clear, repeatable method.
Today, I'm sharing the systematic approach I use to clean any dataset efficiently.
Here's what we're covering:
- The 4-step data cleaning framework that works every time
- Essential Python tools for each step
- Common data quality issues and their fixes
- When to clean vs when to flag (this distinction matters)
Let's get into it...
The 4-Step Data Cleaning Framework
Step 1: Understand → What you're working with
Step 2: Standardise → Make formats consistent
Step 3: Validate → Check for quality issues
Step 4: Document → Track what you changed and why
This isn't about memorising 50 pandas functions.
It's about having a systematic approach that catches problems before they seep into your analysis or worse, get called out by your stakeholders.
Step 1: Understand Your Data
Before you touch anything, you need to know what you're dealing with.
Start with the basics:
# Get the lay of the land
df.info()
df.describe()
df.head(10)
df.tail(10)
Look for red flags:
- Columns with unexpected data types
- Text in numeric columns
- Dates stored as strings
- Missing values in critical fields
Example: Your customer_age
column shows up as object instead of numeric? That's telling you there's text mixed in with numbers.
Step 2: Standardise Formats
Fix data types first:
# Convert obvious numeric columns
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
# Handle dates properly
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
Standardise text data:
# Consistent case and trimming
df['product_name'] = df['product_name'].str.strip().str.title()
# Remove special characters if needed
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
The errors='coerce'
parameter is your friend. It converts invalid entries to NaN instead of crashing your code.
Step 3: Validate and Fix Quality Issues
Handle missing values strategically:
Don't just fill everything with 0 or drop all NaNs. Think about what makes sense.
# Fill with meaningful defaults
df['country'].fillna('Unknown', inplace=True)
# Forward fill for time series
df['price'].fillna(method='ffill', inplace=True)
# Drop rows where critical fields are missing
df.dropna(subset=['customer_id', 'order_date'], inplace=True)
Find and handle duplicates:
# Check for duplicates
duplicates = df.duplicated(subset=['customer_id', 'order_date'])
print(f"Found {duplicates.sum()} duplicates")
# Remove them (keep first occurrence)
df.drop_duplicates(subset=['customer_id', 'order_date'], inplace=True)
Spot outliers and inconsistencies:
# Quick outlier check for numeric columns
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['revenue'] < Q1 - 1.5*IQR) | (df['revenue'] > Q3 + 1.5*IQR)]
Step 4: Document Your Changes
This step separates professionals from amateurs.
Keep track of what you changed:
# Log transformations
cleaning_log = []
cleaning_log.append(f"Removed {duplicates.sum()} duplicate records")
cleaning_log.append(f"Converted {invalid_dates.sum()} invalid dates to NaN")
cleaning_log.append(f"Standardised {cleaned_names.sum()} product names")
Create a validation summary:
print("Data Cleaning Summary:")
print(f"Original rows: {original_shape[0]}")
print(f"Final rows: {df.shape[0]}")
print(f"Columns processed: {df.shape[1]}")
print(f"Missing values remaining: {df.isnull().sum().sum()}")
When to Clean vs When to Flag
Clean when:
- Formatting is inconsistent (Mixed case, extra spaces)
- Data types are wrong but convertible
- Missing values can be reasonably imputed
Flag when:
- Values are clearly wrong but you can't determine the correct value
- Outliers might be legitimate but unusual
- Business rules are violated
Create a separate data_quality_flags
column instead of changing the original data.
The Tools You Actually Need
Core pandas functions:
.info()
and.describe()
for exploration.fillna()
and.dropna()
for missing values.drop_duplicates()
for duplicate handling.str
accessor for text cleaningpd.to_datetime()
andpd.to_numeric()
for type conversion
That's it. You don't need 20 different approaches. Master these basics first.
TL;DR:
- Follow the 4-step framework: Understand → Standardise → Validate → Document
- Fix data types before anything else — wrong types cause 90% of downstream issues
- Handle missing values thoughtfully — don't just fill with zeros or drop everything
- Document your changes — future you (and other devs) will thank present you
- Flag suspicious data instead of guessing — preserve data integrity
The difference between knowing these concepts and being proficient comes down to practice with real, messy datasets.
This week, inside Premium you'll get:
- 3 messy (and cleaned) datasets — retail sales, user activity logs, and customer survey data with different quality issues so you can work through and check against the clean versions
- Python working notebook — ready-to-use code for common cleaning tasks you can drop into any project ready to run straight away
- Data cleaning checklist — my systematic approach for any new dataset from understanding to cleaning so you can refer to it anytime you come across new data that needs some attention
I've designed this weeks resources so you can work through realistic scenarios not just work with "perfect" datasets as that's not what you'll be faced with.
And that’s just the start.
Join today and you'll get £400+ in ready-to-use resources completey FREE:
-
Data Dictionary Builder (£99 value)
-
Executive Data Storytelling Slide Deck (£149 value)
-
Data Salary Negotiation Scripts (£199 value)
-
Full access to the entire archive of past editions
Founding Member Offer
Join now and get 50% off for life — just £9.50/month.
Offer ends 26/09/2025, then the price doubles.
(Don’t forget to use code FOUNDERS2025 at checkout to lock in your lifetime deal.)
Join us below 👇🏻