You have a CSV file you need to audit. You know enterprise profiling tools exist but they're priced for data engineering teams and require days of setup. You need a quality check on this file today. Here's how to do it without enterprise software.
What You're Trying to Learn
Before choosing a method, clarify what you need to know about the CSV:
- Which columns are mostly empty (completeness)?
- Are there duplicate rows or duplicate values in key fields (uniqueness)?
- Do columns have consistent formats (validity and conformity)?
- What are the most common values (distribution)?
- Does the file contain personal data (PII)?
The method you choose depends on how much of this you need and how quickly.
Option 1: Browser-Based Profiling Tools (Fastest, No Setup)
Upload your CSV to a browser-based data quality tool and get an instant profile of every column — completeness rates, distinct value counts, format patterns, uniqueness scores, and PII detection — entirely in your browser. Your file never leaves your machine. No account required for a basic profile.
Sohovi lets you upload your CSV and get an instant data quality report — no setup, no code required. This is the fastest option for non-technical users and for any CSV under a few hundred thousand rows.
Option 2: Excel or Google Sheets (Manual, No Additional Software)
For a small CSV (under 50,000 rows):
- Completeness: Use COUNTBLANK() to count empty cells per column. A formula like =COUNTBLANK(A:A)/COUNTA(A:A) gives you the null rate.
- Duplicates: Use Remove Duplicates (Data tab) or COUNTIF to find repeated values.
- Distribution: Use a pivot table on a categorical column to see value frequencies.
- Min/Max: Use MIN() and MAX() on numeric columns to check ranges.
- Format check: Use conditional formatting to highlight cells that don't match a pattern.
This works but is time-consuming and doesn't scale to large files. For a 10,000-row file with 20 columns, expect 1–2 hours for a thorough manual profile.
Option 3: Python with pandas (Powerful, Requires Basic Coding)
The pandas library's describe() and info() methods provide an instant statistical profile:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.info()) # column types, null counts
print(df.describe()) # min, max, mean, std for numeric columns
print(df.nunique()) # distinct value count per column
print(df.duplicated().sum()) # total duplicate rows
For non-numeric columns:
print(df['status'].value_counts()) # frequency of each value
print(df['email'].str.contains('@').sum()) # count values matching a pattern
This takes about 10 minutes to write and produces a complete profile in seconds. It's the right tool if you're comfortable with Python basics.
A Quick Profile Checklist
Regardless of method, run through these checks:
- Row count: How many rows? Does it match what was expected?
- Null rate per column: Which columns have the most missing values?
- Unique value count per column: Which columns have unexpectedly few or many distinct values?
- Top values per categorical column: Are there obvious typos, variants, or unexpected categories?
- Min/Max for numeric columns: Are the ranges plausible?
- Min/Max for date columns: Are all dates in a plausible range?
- Duplicate row count: How many rows are exact duplicates?
This checklist takes under 30 minutes with any of the methods above and gives you a solid baseline understanding of the file's quality.
When to Use Each Method
| File size | Technical level | Recommendation | |---|---|---| | Any size | Non-technical | Browser-based tool | | < 50,000 rows | Excel user | Excel formulas and pivot tables | | > 50,000 rows | Basic Python | pandas profile script | | Recurring task | Any | Automate with pandas or a profiling library |
What to Do After Profiling
The profile tells you what's wrong. The next step is to prioritize which problems to fix:
- Identify the columns with the highest business impact (the ones used for decisions, reports, and campaigns)
- For those columns, fix the issues that most affect their usability
- Document what you found and fixed — this is your data quality record for this file
Don't try to fix everything. Fix the things that affect your ability to use the data for its intended purpose.
Reading the Profile Results
Once you have a profile, here's how to interpret the most important metrics:
A column with 0% null rate and 100% uniqueness is likely a primary key — every record has a value and no values repeat. A column with 40% null rate means nearly half your records are missing this field. A column where the top value appears in 80% of records might be a poorly used categorical field, a defaulted value, or a legitimate concentration.
The most important insight usually comes from comparing what you expected to what the profile shows. If you expected email to be 100% complete and it's 78% complete, you have a data collection problem. If you expected status to have 4 values and it has 23, you have a standardization problem. Profile the data, then ask: does this match what I believed about it?
If you're ready to stop guessing about your data quality, Sohovi is built for exactly this. Upload your first CSV free — no credit card, no IT team, no code needed.