Skip to main content
Data Profiling

PII Detection: How to Find Personal Information Hidden in Your Datasets

Personal data often hides in unexpected columns of CSV files. PII detection scans for names, emails, phone numbers, SSNs, and more before a compliance issue arises.

You received a vendor export. It was described as "product inventory data." Three columns into the profile, you find a column full of customer email addresses. Another has what looks like partial credit card numbers. No one flagged it as sensitive — because no one looked. That's the hidden PII problem.

PII (Personally Identifiable Information) detection is the process of scanning a dataset to identify columns that contain personal information — names, emails, phone numbers, social security numbers, addresses, dates of birth, and other data that could identify an individual.

Why PII Hides in Unexpected Places

Datasets are assembled, exported, and shared without systematic review. A "sales transactions" export that was supposed to contain only order IDs and amounts might also include a customer name column that the person exporting didn't notice. Vendor-supplied files often include more personal data than necessary. Legacy datasets accumulate PII from systems that no longer exist.

GDPR and CCPA both impose requirements on how PII is handled — including requirements that you know what personal data you hold. Discovering PII in a dataset after a breach is significantly worse than discovering it during a routine profile.

What PII Detection Looks For

Obvious PII by column name — Columns named "email", "phone", "ssn", "dob", "first_name", "last_name" are strong signals. But column names are often misleading or abbreviated.

Pattern-based detection — Values matching email format, phone number patterns, SSN patterns (XXX-XX-XXXX), credit card patterns, or IP address formats indicate PII regardless of column name.

Named entity detection — More sophisticated detection identifies first and last name patterns, address components (street numbers, directional prefixes, city names), and date of birth patterns in free-text fields.

Combination detection — Two columns that are non-PII individually might become PII in combination. ZIP code + birthdate + gender is often uniquely identifying. A good PII assessment considers combinations, not just individual columns.

How to Run a PII Scan

Browser-based tools: Upload your file and receive a column-level PII assessment. These tools check column names and value patterns without transmitting your data to external servers — critical for files that may already contain sensitive information.

Python with regex: Write pattern-matching rules for each PII type and scan every column. This gives complete control but requires coding.

Manual inspection: For small files (under 500 rows), scroll through the data and look for patterns. Useful as a final check but impractical for large datasets.

Sohovi lets you upload your CSV and get an instant data quality report — no setup, no code required. It automatically flags columns that may contain PII as part of the standard profile, so you know what personal data you're working with before you share or process the file.

What to Do When You Find Unexpected PII

In a file you received: Notify the sender. Don't process or share the file until you've confirmed the PII is either authorized or needs to be removed.

In a file you generated: Determine whether the PII needs to be there. If not, remove the column before sharing. If it does need to be there, ensure appropriate access controls before sending.

In a legacy dataset: Document the finding. Assess whether the PII is still needed for any current purpose. If not, delete it. If it is needed, ensure it's properly secured and access-controlled.

Reducing PII Exposure in Regular Workflows

The best way to handle PII is to not have it when you don't need it:

Export only what you need: When generating reports or exports, include only the columns required for the task. Don't export the whole record "just in case."

Anonymize before sharing: Replace identifying information with pseudonyms or hashes when the actual identity isn't needed for analysis. Customer ID instead of email. Age range instead of birthdate.

Set up access controls: PII-containing datasets should not be shared via email to broad distribution lists, stored in public cloud folders, or accessible to everyone in the organization. Limit access to those who need it.

Regular PII audits: Quarterly, scan your key data stores for PII that shouldn't be there. Systems accumulate data over time; regular cleanup prevents the accumulation from becoming a liability.

PII Detection in Different File Types

PII can hide in any file type that contains data. While CSV files are the most common in business contexts, PII detection applies equally to:

Excel files: Same principles as CSV, but with the added complexity of multiple tabs. Each tab may have different data — and PII may only appear in one of them.

JSON files: API responses and database exports often contain PII nested within objects. A profile that checks column names doesn't help with JSON fields — you need to inspect value patterns at every level of the hierarchy.

PDF exports: Reports, invoices, and statements sometimes contain PII that gets extracted into text data during processing. OCR-processed PDFs are particularly risky.

Database dumps: Full database exports for backup or migration purposes often contain PII across dozens of tables. Always scan before sharing or uploading to any external service.

The principle is the same regardless of format: look at both the field names and the actual values before sharing or processing any dataset that came from a system that handles personal information.

If you're ready to stop guessing about your data quality, Sohovi is built for exactly this. Upload your first CSV free — no credit card, no IT team, no code needed.

Sohovi Team

Data quality, for people who ship

The Sohovi team writes practical guides on data quality, profiling, and governance to help teams ship better data.

Start for free

Stop guessing. Start knowing your data quality.

Sohovi profiles your datasets in minutes — surfacing completeness gaps, type mismatches, and duplicate patterns before they reach production.

No credit card required · Free forever plan