Real-time data is only valuable if it arrives reliably and in the right shape. Over the past few years working on GCP-based streaming architectures, I've built and operated pipelines that move millions of events per day from Kafka through Cloud Dataflow into BigQuery. Here's what I've learned.
The Architecture
The canonical pattern we use looks like this:
- Kafka — event ingestion layer. Producers publish raw events (JSON or Avro) to Kafka topics with a schema registered in Confluent Schema Registry.
- Cloud Dataflow (Apache Beam) — the transformation engine. A Beam pipeline reads from Kafka, applies business logic (deduplication, enrichment, field-level transformations), and writes to BigQuery.
- BigQuery — the serving layer. Partitioned and clustered tables support sub-second queries for dashboards and operational analytics.
Why Dataflow Over Alternatives?
We evaluated Apache Flink and Spark Structured Streaming. Dataflow won for one primary reason: fully managed autoscaling. At peak load — say, a product launch or a payment processing surge — Dataflow automatically spins up additional workers. We don't page anyone. With Flink on GKE, we had to pre-provision capacity and tune backpressure settings manually.
Dataflow's native integration with GCP services (Cloud Monitoring, Cloud Logging, BigQuery) also dramatically reduces operational overhead. Metric dashboards for pipeline lag, throughput, and element counts are available out of the box.
Exactly-Once Delivery to BigQuery
Exactly-once semantics between Kafka and BigQuery requires careful design. Dataflow's Kafka connector supports reading with committed offsets, and the BigQuery Storage Write API (in committed mode) supports idempotent writes using stream offsets. Together, these give you true exactly-once delivery without needing application-level deduplication in most cases.
Where we do apply deduplication is when ingesting from Kafka topics that have producers with at-least-once delivery guarantees. In those cases, we use a 24-hour deduplication window via BigQuery MERGE scheduled queries, keyed on a business event ID.
Handling Schema Evolution
Schema changes are inevitable. Our approach:
- All Kafka messages use Avro with backward-compatible schema evolution enforced at the registry level.
- Dataflow pipelines use
BigQueryIO.Write.withSchemaUpdateOptions()to allow new nullable fields to be added to the target table automatically. - Breaking changes (field removals, type changes) go through a versioned topic migration — the old pipeline drains, a new topic and pipeline spin up in parallel, and cutover happens when the new pipeline reaches steady state.
Observability
We instrument every pipeline with:
- Consumer lag per Kafka topic-partition (alerted if lag exceeds 5 minutes)
- Dataflow system lag and data watermark delay
- BigQuery streaming buffer flush latency
- Dead-letter queue (GCS bucket) message count — any unparseable message goes here
Cloud Monitoring custom dashboards surface all of this in one place. PagerDuty alerts fire on lag spikes before they become user-visible incidents.
Key Takeaways
Streaming to BigQuery via Dataflow is mature, operationally simple, and scales transparently. The investment is in upfront design: schema contracts, exactly-once guarantees, and observability. Get those right and the pipeline largely runs itself.