The 4-Step Data Cleaning Framework Every Analyst Needs (Python)

← Back to all editions

Hey — Tom here.

Nothing kills an analysis faster than messy data.

You've got duplicates, missing values, inconsistent formats, and columns that make no sense.

Before you know it, you're spending 80% of your time wrestling with data instead of extracting insights from it.

Python makes data cleaning powerful, but most tutorials throw you into it without a clear, repeatable method.

Today, I'm sharing the systematic approach I use to clean any dataset efficiently.

Here's what we're covering:

The 4-step data cleaning framework that works every time
Essential Python tools for each step
Common data quality issues and their fixes
When to clean vs when to flag (this distinction matters)

Let's get into it...

The 4-Step Data Cleaning Framework

Step 1: Understand → What you're working with

Step 2: Standardise → Make formats consistent

Step 3: Validate → Check for quality issues

Step 4: Document → Track what you changed and why

This isn't about memorising 50 pandas functions.

It's about having a systematic approach that catches problems before they seep into your analysis or worse, get called out by your stakeholders.

Step 1: Understand Your Data

Before you touch anything, you need to know what you're dealing with.

Start with the basics:

# Get the lay of the land
df.info()
df.describe()
df.head(10)
df.tail(10)

Look for red flags:

Columns with unexpected data types
Text in numeric columns
Dates stored as strings
Missing values in critical fields

Example: Your customer_age column shows up as object instead of numeric? That's telling you there's text mixed in with numbers.

Step 2: Standardise Formats

Fix data types first:

# Convert obvious numeric columns
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

# Handle dates properly
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

Standardise text data:

# Consistent case and trimming
df['product_name'] = df['product_name'].str.strip().str.title()

# Remove special characters if needed
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

The errors='coerce' parameter is your friend. It converts invalid entries to NaN instead of crashing your code.

Step 3: Validate and Fix Quality Issues

Handle missing values strategically:

Don't just fill everything with 0 or drop all NaNs. Think about what makes sense.

# Fill with meaningful defaults
df['country'].fillna('Unknown', inplace=True)

# Forward fill for time series
df['price'].fillna(method='ffill', inplace=True)

# Drop rows where critical fields are missing
df.dropna(subset=['customer_id', 'order_date'], inplace=True)

Find and handle duplicates:

# Check for duplicates
duplicates = df.duplicated(subset=['customer_id', 'order_date'])
print(f"Found {duplicates.sum()} duplicates")

# Remove them (keep first occurrence)
df.drop_duplicates(subset=['customer_id', 'order_date'], inplace=True)

Spot outliers and inconsistencies:

# Quick outlier check for numeric columns
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['revenue'] < Q1 - 1.5*IQR) | (df['revenue'] > Q3 + 1.5*IQR)]

Step 4: Document Your Changes

This step separates professionals from amateurs.

Keep track of what you changed:

# Log transformations
cleaning_log = []
cleaning_log.append(f"Removed {duplicates.sum()} duplicate records")
cleaning_log.append(f"Converted {invalid_dates.sum()} invalid dates to NaN")
cleaning_log.append(f"Standardised {cleaned_names.sum()} product names")

Create a validation summary:

print("Data Cleaning Summary:")
print(f"Original rows: {original_shape[0]}")
print(f"Final rows: {df.shape[0]}")
print(f"Columns processed: {df.shape[1]}")
print(f"Missing values remaining: {df.isnull().sum().sum()}")

When to Clean vs When to Flag

Clean when:

Formatting is inconsistent (Mixed case, extra spaces)
Data types are wrong but convertible
Missing values can be reasonably imputed

Flag when:

Values are clearly wrong but you can't determine the correct value
Outliers might be legitimate but unusual
Business rules are violated

Create a separate data_quality_flags column instead of changing the original data.

The Tools You Actually Need

Core pandas functions:

.info() and .describe() for exploration
.fillna() and .dropna() for missing values
.drop_duplicates() for duplicate handling
.str accessor for text cleaning
pd.to_datetime() and pd.to_numeric() for type conversion

That's it. You don't need 20 different approaches. Master these basics first.

TL;DR:

Follow the 4-step framework: Understand → Standardise → Validate → Document
Fix data types before anything else — wrong types cause 90% of downstream issues
Handle missing values thoughtfully — don't just fill with zeros or drop everything
Document your changes — future you (and other devs) will thank present you
Flag suspicious data instead of guessing — preserve data integrity

The difference between knowing these concepts and being proficient comes down to practice with real, messy datasets.

This week, inside Premium you'll get:

3 messy (and cleaned) datasets — retail sales, user activity logs, and customer survey data with different quality issues so you can work through and check against the clean versions
Python working notebook — ready-to-use code for common cleaning tasks you can drop into any project ready to run straight away
Data cleaning checklist — my systematic approach for any new dataset from understanding to cleaning so you can refer to it anytime you come across new data that needs some attention

I've designed this weeks resources so you can work through realistic scenarios not just work with "perfect" datasets as that's not what you'll be faced with.

And that’s just the start.

Join today and you'll get £400+ in ready-to-use resources completey FREE:

Data Dictionary Builder (£99 value)
Executive Data Storytelling Slide Deck (£149 value)
Data Salary Negotiation Scripts (£199 value)
Full access to the entire archive of past editions

Founding Member Offer
Join now and get 50% off for life — just £9.50/month.

Offer ends 26/09/2025, then the price doubles.

(Don’t forget to use code FOUNDERS2025 at checkout to lock in your lifetime deal.)

Join us below 👇🏻

This post is for paying subscribers only

Upgrade

Already have an account? Log in

You Might Also Like...

Join The FREE Challenge

The easy way to clean data in Python