Data Quality in Real-Time Data Pipelines: Catching Problems as They Happen

Data quality in real-time pipelines means validating each event or record as it arrives in the stream — rather than waiting for a batch job to find problems hours later.

Selva SantoshData quality, for people who ship

May 21, 20269 min read

Data quality in real-time pipelines means validating each event or record as it arrives in the stream — rather than waiting for a batch job to find problems hours later.

In batch systems, a quality failure affects yesterday's data. In real-time pipelines, a quality failure affects what's happening right now — customer-facing features, live dashboards, automated decisions. The stakes are higher, and the window to catch problems before they matter is much shorter.

In this guide

Why real-time pipelines need different quality strategies than batch
The main quality failures that occur in streaming systems
How to validate data inline in a stream
Dead letter queues and failure routing
Monitoring and alerting for real-time quality

Why Batch Validation Doesn't Work in Real-Time

Batch validation runs after data is collected. For a nightly ETL job, that's fine — the data sits in a staging table, you run checks, you either promote or reject. Nothing downstream was using that data while it waited.

Sohovi lets you set up validation rules for any column and instantly see which rows fall outside them — no code or SQL required.

In a real-time pipeline, data flows directly into features and systems that are in use right now. A bad event that passes through undetected doesn't sit in a staging table — it updates a user's recommendation feed, triggers an alert, adjusts a price, or feeds a fraud model. By the time a batch validation job runs tomorrow morning, the damage is done.

Real-time quality requires inline validation — checks that run on each event as it passes through the pipeline, before it reaches any consumer.

Common Quality Failures in Streaming Pipelines

Late-arriving events: Events that arrive out of order or hours after they were generated. A purchase event with a timestamp from yesterday arriving today breaks time-window aggregations.

Schema drift: A producer changes a field name or data type and doesn't notify consumers. The pipeline continues to process events, but a field that was string is now an object — and downstream consumers break silently.

Missing required fields: An event fires without all the required fields populated. The event is structurally valid but informationally incomplete.

Duplicate events: The same event arrives multiple times due to producer retries, network failures, or at-least-once delivery guarantees. Aggregations double-count. Idempotent processing is required.

Sohovi automatically finds every duplicate in your dataset — including near-matches — and shows you exactly which rows are affected.

Volume anomalies: Sudden spikes or drops in event rate. A spike might indicate a producer bug creating phantom events. A drop might indicate an outage in the producing system that's silently not emitting events.

Invalid values: Events with values outside expected ranges — negative quantities, future timestamps on historical events, impossible enum values — that pass structural validation but fail business logic checks.

How to Validate Data Inline in a Stream

Schema validation

Every event should conform to a defined schema. Use a schema registry (Confluent Schema Registry for Kafka, AWS Glue Schema Registry for Kinesis) to enforce schema compatibility. When a producer tries to emit an event with an incompatible schema, the registry rejects it before it enters the stream.

This is the first line of defense — it catches structural problems at the source rather than downstream.

Field-level validation

For each event type, define the fields that must be present, their expected types, and any format or range constraints. Run these checks inline using your stream processor (Flink, Spark Streaming, Kafka Streams, Lambda):

Required fields: reject events where key fields are null
Type checks: confirm integer fields aren't strings, timestamp fields are parseable
Range checks: flag events where numeric values are outside expected bounds
Enum validation: confirm categorical fields contain only allowed values

Deduplication

Use an event ID (if the producer generates one) and a short deduplication window to filter duplicate events. For pipelines that process purchases, sign-ups, or financial transactions, deduplication is not optional — one missed duplicate equals one inflated metric.

Most stream processors support stateful deduplication within a window. Choose window size based on expected producer retry behavior.

Volume monitoring

Track events-per-minute by event type. Set alert thresholds for drops greater than a configured percentage over a rolling window. A 50% drop in checkout events at 2pm on a weekday is not a seasonal pattern — it's an outage.

Tools like Sohovi are useful for validating historical snapshots and batch exports from these pipelines — helping you understand whether what you're seeing in real-time matches historical distribution patterns.

Dead Letter Queues: Where Failing Events Go

A dead letter queue (DLQ) is a separate destination for events that fail validation. Instead of dropping bad events silently or crashing the pipeline, failed events are routed to the DLQ with a reason code attached.

Why DLQs matter:

Bad events are preserved, not lost — you can reprocess them once the issue is fixed
The main pipeline continues processing valid events without interruption
Operations teams get a clear view of what's failing and why
Repeated failure patterns surface systemic upstream problems

DLQ best practices:

Include the original event, the failure reason, and the timestamp it was received
Alert when the DLQ accumulates events above a threshold (don't let DLQs grow silently)
Review DLQ contents after any upstream system change
Build reprocessing pipelines to replay DLQ events after fixes are applied

Monitoring and Alerting for Real-Time Quality

Real-time quality without monitoring is incomplete. You need visibility into what's failing, when, and at what rate.

Metrics to track:

Validation failure rate: percentage of events failing each check, by event type
DLQ growth rate: events added to DLQ per minute/hour
Schema rejection rate: events rejected at the schema registry level
Processing lag: how far behind is the consumer from the producer? Growing lag often indicates a quality problem causing processing slowdown
Late arrival rate: percentage of events arriving outside the expected time window

Alerting rules:

Alert immediately on schema rejection spikes (indicates a producer deployment broke compatibility)
Alert when validation failure rate exceeds baseline by more than X% (indicates a systematic upstream change)
Alert on DLQ growth above threshold
Alert on processing lag exceeding your SLA

Set up dashboards in your observability platform (Grafana, Datadog, CloudWatch) that show these metrics over time. Correlation between pipeline failures and upstream deployment timestamps is often the fastest way to diagnose the root cause.

Frequently Asked Questions

Q: Can I add data quality validation to an existing real-time pipeline without rewriting it? Yes, but it depends on your stack. Most stream processors (Kafka Streams, Flink, Lambda) support adding a validation step to an existing topology. The challenge is deciding what to do with failing events — you need to design a DLQ or rejection flow if one doesn't exist.

Q: What's the difference between schema validation and field-level validation? Schema validation checks the structure of an event — are the right fields present, are the types correct? Field-level validation checks the content — is the value within an acceptable range, does it match a pattern, is it a valid enum value? Both are necessary. Schema validation is enforced at the registry; field-level validation runs in your stream processor.

Q: How do I handle late-arriving events in a real-time pipeline? Define a watermark — the maximum age of an event you'll accept. Events arriving before the watermark are processed normally. Events arriving after are either dropped or routed to a late-arrival handler for separate processing. Most stream processing frameworks (Flink, Spark Structured Streaming) have built-in watermark support.

Q: Should every event type have its own validation rules? Yes. Different event types carry different data with different quality requirements. A click event and a payment event have very different criticality and field requirements. Define validation rules per event type and store them as versioned configuration.

Q: What's the performance impact of inline validation? Simple field checks (null check, type check, range check) add microseconds of latency per event — negligible at all but the most extreme throughputs. Stateful operations (deduplication, join-based referential integrity) add more latency and require memory. For very high-throughput pipelines, profile validation latency and optimize the most expensive checks.

Q: How do I test my validation logic before deploying it to production? Replay historical events through your validation logic in a test environment. Introduce intentionally malformed events (null required fields, out-of-range values, duplicate IDs) and verify they're caught and routed correctly. Load test with production-scale event rates to confirm latency stays within bounds.

Q: What happens if my validation logic itself has a bug? It will either reject valid events (false positives) or pass invalid events (false negatives). Both are costly. Test validation rules thoroughly before deployment, and monitor failure rates in real-time after any rule change. A sudden spike in validation failures after a deployment usually points to a validation rule bug, not an upstream quality problem.

Q: How do I handle duplicate events when producers use at-least-once delivery? Use a deduplication window based on a unique event ID. If the producer doesn't generate event IDs, you may need to construct a composite key from event fields (user_id + event_type + timestamp). The window size should be longer than your producer's maximum retry interval — typically 1 to 30 minutes depending on the system.

Q: Should real-time validation be strict (reject) or lenient (flag)? It depends on the consumer. If the downstream consumer is a financial system or a customer-facing feature that can't tolerate bad data, reject failing events. If the consumer is an analytics pipeline where approximate results are acceptable, flag-and-pass with quality metadata attached to the event. Define the tolerance level per consumer, not universally.

Q: What's the best way to communicate real-time data quality issues to upstream teams? Automate notifications tied to DLQ growth and validation failure spikes. When a producer team deploys a change that breaks schema compatibility, they should receive an alert within minutes — not learn about it from an analyst report the next morning. Tight feedback loops between producers and pipeline quality metrics prevent repeated incidents.

If you want to validate the historical and batch side of your pipeline data, Sohovi profiles CSV exports and data samples instantly — surfacing distribution anomalies, null rates, and format issues before they enter your system. Try it free — no code, no setup required.

Selva Santosh

Data quality, for people who ship

Selva writes practical guides on data quality, profiling, and governance to help teams ship better data.

Start for free

Stop guessing. Start knowing your data quality.

Sohovi profiles your datasets in minutes — surfacing completeness gaps, type mismatches, and duplicate patterns before they reach production.

Try Sohovi free More articles

No credit card required · Free forever plan