December 10, 2025 · 10 min read

Migrating Enterprise Data to BigQuery: Oracle and Snowflake Patterns

Practical patterns for building reliable batch pipelines from Oracle and Snowflake into BigQuery using Cloud Dataflow and GCS as the staging layer.

Enterprise data doesn't live in one place. In most organisations I've worked with, critical operational data lives in Oracle databases, while analytics teams have already moved some workloads to Snowflake. The goal is to land all of it in BigQuery for a single analytics surface. Here's how we've done it reliably at scale.

The Common Pattern: Dataflow + GCS as the Bridge

Whether the source is Oracle or Snowflake, the architecture is consistent:

Extract — pull data from the source system using JDBC (Oracle) or Snowflake's connector.
Stage — land raw extracts in Cloud Storage (GCS) as Parquet or Avro files.
Transform & Load — a Dataflow job reads from GCS, applies transformations, and writes to BigQuery using batch loads (not streaming inserts, which cost more for large volumes).

GCS as a staging layer is non-negotiable. It decouples extraction from loading, gives you a reprocessing point if something goes wrong, and allows BigQuery to load from GCS at very high throughput using its native bulk load API.

Oracle to BigQuery

Oracle extractions require careful attention to data types. Oracle's NUMBER type is notoriously flexible — the same column can contain integers and decimals. We map Oracle types to BigQuery types explicitly in our schema config file, reviewed for every table before first load:

NUMBER(p,0) → INT64
NUMBER(p,s) → NUMERIC
VARCHAR2 → STRING
DATE → DATETIME (Oracle DATE includes time)
CLOB → STRING (with size validation)

For large tables (100M+ rows), we partition extractions by a date or ID range column. Each partition becomes an independent Dataflow job step, allowing parallel extraction without overloading the source Oracle instance.

Incremental loads use Oracle's high-watermark approach: we store the last-loaded UPDATED_AT timestamp in a metadata table and extract only rows modified since that point. For tables without a reliable audit timestamp, we fall back to full-load with BigQuery MERGE for deduplication.

Snowflake to BigQuery

Snowflake extractions are more straightforward. Snowflake's COPY INTO command exports data to GCS directly, bypassing the need for a JDBC connector entirely:

COPY INTO 'gcs://my-bucket/exports/table_name/'
FROM my_schema.my_table
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE;

A Dataflow pipeline then picks up the exported Parquet files from GCS and loads them into BigQuery. This approach is fast (Snowflake's parallel export is highly optimised), cost-effective (GCS egress is cheaper than Snowflake's compute for large exports), and simple to orchestrate with Airflow.

For incremental loads, Snowflake Streams provide a reliable CDC mechanism. We consume the stream in scheduled Airflow runs, export the delta to GCS, and apply it to BigQuery using a MERGE statement.

Airflow Orchestration

Both pipelines are orchestrated with Apache Airflow. The DAG structure for each table is:

Extract to GCS (via Dataflow or Snowflake COPY)
Validate file count and row count against source
Trigger BigQuery load job
Run post-load data quality checks (row count, null rate, value distribution)
Update metadata watermark table
Send success/failure notification to Slack

Steps 2 and 4 are critical. We've caught several cases where an upstream Oracle maintenance window caused a partial extract — the validation step caught it before bad data reached the serving layer.

Lessons Learned

The biggest gotchas in enterprise-to-BigQuery migrations are always the data, not the infrastructure. Source systems have quirks: Oracle NULL semantics differ from BigQuery's, Snowflake VARIANT columns need explicit schema extraction, date formats vary by region. Budget time to understand the data before writing the pipeline. The code is the easy part.

← All articles