How to Set Data Quality Thresholds That Actually Make Sense

A data quality threshold is the minimum acceptable level of quality for a specific field or dataset — the point at which quality is good enough for its intended use, below which action is required.

Selva SantoshData quality, for people who ship

May 21, 20268 min read

A data quality threshold is the minimum acceptable level of quality for a specific field or dataset — the point at which quality is good enough for its intended use, below which action is required.

"Zero errors allowed" sounds rigorous but isn't practical. A customer email field at 99.5% completeness isn't the same risk profile as an optional demographic field at 70% completeness. Every field has a different acceptable quality level — and setting one "100% or failure" threshold for everything either produces constant false alarms or gets ignored entirely.

This guide shows you how to set thresholds that reflect real business risk rather than theoretical perfection.

Sohovi tracks quality trends across runs and alerts you when a metric — null rate, duplicate count, score — moves outside its normal range.

In this guide

Why thresholds matter and what happens without them
The three factors that determine the right threshold for any field
How to set thresholds for completeness, uniqueness, and validity
Industry benchmarks as starting points (not rules)
How to adjust thresholds based on experience

Why Thresholds Matter

Without defined thresholds, data quality monitoring produces one of two outcomes:

Everything fails: If you treat any data quality issue as a failure, your dashboards show constant red. Teams start ignoring alerts because "it's always red." The monitoring becomes noise.

Nothing fails: If you don't define what failure looks like, nothing is ever flagged. Quality drifts undetected until something breaks visibly.

Thresholds convert a continuous measurement into a meaningful signal: above the threshold is acceptable, below is actionable. That signal is only useful if the threshold is set correctly.

The Three Factors That Determine the Right Threshold

Factor 1: How the Field Is Used

A field used in a customer-facing communication (email, name, address) needs a much higher threshold than a field used only for internal analytics. A required field in a financial calculation needs a higher threshold than an optional enrichment field.

Ask: "What breaks if this field has bad values?" The severity of the answer directly informs the threshold.

Factor 2: The Cost of False Positives vs. False Negatives

A threshold that's too high triggers unnecessary alerts (false positives) — actions that cost time but weren't actually needed. A threshold that's too low allows real problems to pass undetected (false negatives) — problems that cause damage.

Set the threshold closer to the higher end for fields where problems are costly. Set it lower for fields where over-reaction is more disruptive than the problem itself.

Factor 3: Historical Baseline

The right threshold starts with your actual baseline. If your email completeness has consistently been 97–98% for the past year, a threshold of 95% gives you meaningful headroom. A threshold of 99% would trigger constant alerts for normal variation.

Setting Thresholds by Quality Dimension

Completeness Thresholds

| Field Use | Recommended Threshold | |---|---| | Customer-facing required field (email for campaigns) | ≥ 98% | | Sales/CRM identifier (phone for outreach) | ≥ 95% | | Internal analytics field (industry, company size) | ≥ 80% | | Optional enrichment field | ≥ 60% |

Rule of thumb: Start at 5 percentage points below your current average. If your email completeness has averaged 97.5%, set the threshold at 92.5%. This catches genuine declines without flagging normal variation.

Uniqueness Thresholds

| Entity Type | Recommended Threshold | |---|---| | Primary identifier (customer ID, order ID) | 100% unique | | Business-critical identifier (email in contact DB) | ≥ 99.5% unique | | Semi-controlled identifier (company name) | ≥ 97% unique |

For primary identifiers, the threshold must be 100%. A duplicate customer ID or order ID is not an acceptable error rate — it's a system integrity failure.

Validity Thresholds

| Field Type | Recommended Threshold | |---|---| | Customer-facing format (email, phone) | ≥ 99% valid | | Internal format (internal codes, IDs) | ≥ 98% valid | | Date fields | ≥ 97% valid | | Categorical/enum fields | ≥ 95% valid |

Using Industry Benchmarks as Starting Points

Some published benchmarks can serve as initial reference points:

Email hard bounce rate: Industry best practice is below 0.5%; above 2% is a problem
CRM data accuracy: Gartner and Forrester research suggests best-in-class CRM data quality maintains 95%+ contact accuracy
B2B contact data decay: Industry estimates put decay at ~30% per year, which informs timeliness thresholds

Important: These are starting points, not rules. Your actual threshold should reflect your specific data, usage, and risk tolerance — not an industry average.

How to Adjust Thresholds Based on Experience

Thresholds aren't set-and-forget. Adjust them when:

The threshold triggers too frequently: If you're getting an alert every week for minor variation around a healthy average, raise the threshold or add a hysteresis band (alert only if the threshold is breached for two consecutive periods).

The threshold never triggers but problems emerge: If quality is visibly degrading but no alerts are firing, your threshold is too low. Raise it.

Business requirements change: If you add a new use case for a field (you start using a field for segmentation that was previously just informational), revisit the threshold to reflect the new risk.

Frequently Asked Questions

Q: What is a data quality threshold? A data quality threshold is the minimum acceptable level of quality for a specific field or dataset — defined as a percentage for completeness and validity, or a count/percentage for uniqueness. When a metric falls below the threshold, action is required.

Q: Should all fields have the same data quality threshold? No. The right threshold depends on how the field is used, what breaks if values are wrong, and the historical baseline for that field. A customer email field used for campaign sends needs a much higher threshold than an internal notes field.

Q: What is a good data completeness threshold? For customer-facing required fields (email addresses used in campaigns), 98% or higher is appropriate. For internal analytics fields, 80% may be sufficient. The right threshold depends on the field's use case and the cost of having incomplete records.

Q: Is 100% data quality always the right goal? No. For most fields, 100% is unachievable without rejecting legitimate records that genuinely have no value for that field. Some customers don't have phone numbers. Some contacts don't have company affiliations. Setting 100% as a threshold for optional fields produces constant false alarms.

Q: What's the difference between a data quality threshold and a data quality standard? A data quality standard defines what "good" looks like (e.g., "email addresses must be valid format"). A threshold defines the minimum acceptable percentage of records meeting that standard (e.g., "at least 99% of email addresses must be valid format"). Standards define the rule; thresholds define how often it must be met.

Q: How do I handle a field where quality varies by segment? Set segment-specific thresholds if the variation is meaningful. A B2B contact database might have different phone completeness expectations for enterprise vs. SMB contacts. If managing segment-level thresholds is too complex, use the most conservative threshold that applies across all segments.

Q: What should I do when a field consistently falls just below threshold? Investigate whether the threshold is right or whether there's a real quality problem. If historical data shows the field has never reached the threshold, the threshold may be too high. If it used to meet the threshold and recently declined, there's a real quality issue to investigate.

Q: How should I communicate thresholds to my team? Make them part of your data quality policy: "Email completeness must be ≥ 98% before any campaign send." Document them where they'll be seen — in runbooks, checklists, and data import procedures. The team needs to know the standard before they can meet it.

Q: What's the difference between an alert threshold and a hard stop threshold? An alert threshold triggers a notification when a metric falls below a level that warrants attention. A hard stop threshold blocks a process (a campaign send, an import, a report publication) until the issue is resolved. Alert thresholds are higher; hard stop thresholds are lower and represent a more serious quality failure.

Q: Can I use the same thresholds across all my datasets? You can use consistent thresholds as defaults, but the right thresholds differ by dataset and field based on usage and risk. A single universal threshold is better than no thresholds at all, but field-specific thresholds are more accurate.

Start with your current quality baseline. Set your threshold 5 percentage points below where you currently are. Adjust as you learn more about what "good enough" actually means for each field and use case.

Sohovi shows you your actual completeness, uniqueness, and validity rates for every column when you upload a CSV. That baseline is the foundation for setting meaningful thresholds. Free to try — no credit card, no code required.

Selva Santosh

Data quality, for people who ship

Selva writes practical guides on data quality, profiling, and governance to help teams ship better data.

Start for free

Stop guessing. Start knowing your data quality.

Sohovi profiles your datasets in minutes — surfacing completeness gaps, type mismatches, and duplicate patterns before they reach production.

Try Sohovi free More articles

No credit card required · Free forever plan