Skip to main content
Data Quality Glossary

What Is Data Cleansing?

Data cleansing is the process of identifying and correcting errors in a dataset — removing duplicates, fixing formats, filling missing values, and standardizing inconsistencies.

Data cleansing (also called data cleaning or data scrubbing) is the process of detecting and correcting inaccurate, incomplete, inconsistent, or incorrectly formatted records in a dataset — making the data accurate, complete, and consistent enough for its intended use.

Data cleansing is remediation: you take data that has quality problems and fix them. It's the complement to data quality measurement (which identifies what's wrong) and data quality prevention (which stops new problems from entering).

What Data Cleansing Fixes

Duplicates: Identifying and merging records that represent the same real-world entity — two customer records with the same email address, or a vendor who appears under three different name variations.

Sohovi automatically finds every duplicate in your dataset — including near-matches — and shows you exactly which rows are affected.

Missing values: Filling in fields that are empty — either from other sources (enrichment), from inference (a country code inferred from a phone number), or from a documented default.

Invalid formats: Correcting values that don't match the expected pattern — phone numbers with letters, dates in the wrong format, email addresses missing the "@" symbol.

Inconsistencies: Standardizing values that are expressed differently across records — "NY" vs. "New York" vs. "new york," or "Active" vs. "active" vs. "ACTIVE."

Outliers: Investigating and correcting values that are statistically implausible — a customer age of 847, a price of -$5,000, a transaction date in 1850.

Structural problems: Fixing encoding issues that produce garbled characters, correcting field mapping errors from imports, and addressing schema mismatches.

Data Cleansing vs. Data Quality

These terms are related but distinct. Data quality is the measurement of how fit data is for its intended use — it produces a score, a report, a list of issues. Data cleansing is the remediation — the actual work of fixing those issues. You measure quality first, then cleanse.

[IMAGE: Before and after showing a dataset with duplicates, wrong formats, and missing values — and the same dataset after cleansing]

Frequently Asked Questions

Q: What is data cleansing? Data cleansing is the process of identifying and correcting errors, inconsistencies, and quality problems in a dataset. It includes removing duplicates, fixing format errors, filling missing values, standardizing inconsistent representations, and correcting outliers.

Q: What is the difference between data cleansing and data quality? Data quality is the measurement of how fit data is for its purpose — it identifies what's wrong. Data cleansing is the remediation — fixing what's wrong. Data quality assessment comes first and informs what the cleansing effort should address.

Q: Is data cleansing a one-time project or ongoing work? Usually both. A one-time remediation project cleans existing bad data. Ongoing maintenance prevents new bad data from accumulating. Without both, you clean the data and watch the same problems return within months.

Q: What is the most time-consuming part of data cleansing? Discovery and decision-making — determining what rules to apply and how to handle edge cases. The actual mechanical transformation is often faster than deciding what the correct value should be for ambiguous cases.

Q: Can data cleansing be automated? Repetitive, rule-based cleansing can be automated — standardizing date formats, removing leading/trailing whitespace, converting categorical values to canonical forms. Judgment-intensive cleansing (resolving conflicting values, determining which of two duplicate records to keep) requires human review.

Q: What should I do before cleansing data? Profile it first. A data quality profile shows you what problems exist, how widespread they are, and where they're concentrated. Cleansing without profiling means guessing at what to fix.

Q: How do I know when data is clean enough? Define your quality thresholds before you start: "email field must be 98% complete and 99% valid." Cleanse until you meet those thresholds. Perfect data is rarely necessary and is expensive to achieve.

Q: Does data cleansing change the original data? It can — but best practice is to work on a copy, document every transformation, and preserve the original. This allows you to verify results and reverse changes if needed.

Q: What tools are used for data cleansing? OpenRefine (free, file-based), Excel/Google Sheets (small datasets), Python pandas (programmatic, large datasets), and dedicated data quality platforms. Sohovi profiles your data instantly to show what needs to be cleaned — a fast first step before any cleansing work begins.

Q: What is the difference between data cleansing and data wrangling? Data wrangling is broader — it includes cleaning but also structural transformation (reshaping, pivoting), enrichment (adding fields from external sources), and validation. Data cleansing focuses specifically on correcting errors and inconsistencies.


Data cleansing turns bad data into trustworthy data. Profile first to understand the problems, then cleanse systematically — and put prevention in place so you don't have to do it all over again next quarter.

Sohovi shows you exactly what is wrong with your data — completeness gaps, type mismatches, duplicates — in one clear report.

Sohovi lets you upload your CSV and get an instant data quality report — no setup, no code required.

If you're ready to stop guessing about your data quality, Sohovi is built for exactly this. Upload your first CSV free — no credit card, no IT team, no code needed.

Sohovi Team

Data quality, for people who ship

The Sohovi team writes practical guides on data quality, profiling, and governance to help teams ship better data.

Start for free

Stop guessing. Start knowing your data quality.

Sohovi profiles your datasets in minutes — surfacing completeness gaps, type mismatches, and duplicate patterns before they reach production.

No credit card required · Free forever plan