Over three years at Ford Motors and more recently at Tredence, I've been involved in building and operating GCP data platforms that process terabytes of data daily. The infrastructure choices we made — or didn't make — early on had compounding effects. Here's what I'd do from the start.
BigQuery: Design for Query Cost, Not Just Storage
The most impactful BigQuery decisions are made at table design time, not query time.
Partitioning
Every large table should be partitioned, almost without exception. Partition by ingestion date (_PARTITIONTIME) for event data, or by a date column in the data itself for transactional records. A query that can filter by partition scans a fraction of the data — and BigQuery charges by data scanned.
We had a table at Ford with 8 years of vehicle telemetry: 2 trillion rows, ~40TB. Unpartitioned, a "last 7 days" query cost $200 and took 4 minutes. Partitioned by day, the same query cost $3 and ran in 8 seconds.
Clustering
After partitioning, add clustering on the columns most frequently used in WHERE and JOIN clauses. Clustering physically co-locates rows with the same cluster key values, allowing BigQuery to prune blocks that don't match your filter — additional free optimisation on top of partitioning.
Slot Reservations vs On-Demand
For predictable, high-volume workloads, committed slot reservations are significantly cheaper than on-demand pricing. We moved our production analytics workload to a 500-slot reservation at Ford and reduced monthly BigQuery spend by 60% despite query volume doubling. The break-even point is roughly when you're spending $2,000+/month on on-demand queries.
Dataflow Pipeline Patterns
Three patterns that have worked well across every project:
1. Separate Ingestion from Transformation
Land raw data in GCS first. Transform in a second Dataflow step. This gives you a reprocessing point: if your transformation logic is wrong, you can fix it and replay from GCS without re-extracting from the source. It sounds obvious, but in practice teams often skip the staging layer to save time and regret it when a bug requires a full re-extract.
2. Use Deadletter Queues
Every Dataflow pipeline should route unparseable or invalid records to a GCS deadletter bucket, not fail the entire job. An anomalous record is a signal to investigate, not a reason to halt 10 million good records. Alert on deadletter volume and review regularly.
3. Idempotent Writes
Design your BigQuery write logic so that re-running a pipeline produces the same result as running it once. Use partition-level truncate-and-reload for batch jobs, or the BigQuery Storage Write API in committed mode for streaming. Idempotency makes debugging, recovery, and re-runs stress-free.
Airflow Orchestration at Scale
At Tredence, we run 200+ production DAGs in Airflow. A few practices that keep it manageable:
- One DAG per logical domain. Don't create monolithic DAGs with 50 tasks. Smaller, focused DAGs are easier to debug and retry at the right granularity.
- Use sensors sparingly.
BigQueryTablePartitionExistenceSensorand similar sensors hold a worker slot while polling. Use them for cross-DAG dependencies, but prefer Airflow's Dataset-based scheduling (v2.4+) for data-driven triggers. - SLA monitoring. Set SLA on critical DAGs. Airflow sends alerts when a DAG run doesn't complete within the expected window — essential for pipelines that feed daily reporting.
- Parameterise everything. Use Airflow Variables or Connections for environment-specific values. The same DAG code should run in dev, staging, and prod by changing a config, not the code.
Data Governance from Day One
Retroactively applying data governance is painful. Start with:
- Column-level access policies in BigQuery for PII fields
- Data Catalog tags on all datasets describing owner, SLA, and sensitivity level
- Audit logs routed to BigQuery for compliance querying
At Ford, implementing GDPR-compliant data access controls after the fact required a three-month effort across 200+ tables. Starting with governance built into the pipeline design would have taken a fraction of that time.
Summary
A well-designed GCP data platform is partitioned and clustered in BigQuery, builds on staged Dataflow pipelines, orchestrates cleanly with Airflow, and embeds governance from the start. These aren't optional polish — they're the foundation that determines whether your platform grows gracefully or becomes a maintenance burden.