119 · ETL · PIPELINE · DATA WAREHOUSE

ETL Pipelines

Extract, transform, and load data from sources to destinations.

If you are new here: ETL stands for Extract, Transform, Load — and it describes the fundamental workflow for moving data from where it's produced to where it's analyzed. Your application stores orders in PostgreSQL. Your CRM stores customers in Salesforce. Your payment processor stores transactions in Stripe. None of these systems talk to each other, and none of them are optimized for analytics. An ETL pipeline extracts data from each source, transforms it (cleans, joins, standardizes, enriches), and loads it into a destination built for analytics — typically a data warehouse like Snowflake, BigQuery, or Redshift. From there, your data team runs SQL queries, builds dashboards, trains ML models, and answers business questions that span multiple source systems. Modern practice often reverses the order: ELT (Extract, Load, Transform) loads raw data first and transforms inside the warehouse using tools like dbt — taking advantage of cloud warehouses' massive parallel SQL processing power.

TermPlain meaning
ETLExtract → Transform → Load: transform data before loading to destination
ELTExtract → Load → Transform: load raw first, transform inside the warehouse
Data warehouseA database optimized for analytics — columnar storage, fast aggregations, SQL
Staging areaTemporary storage for extracted raw data before transformation
dbtData Build Tool — write transforms as SQL SELECT statements, version-controlled
Snowflake / BigQueryCloud data warehouses — serverless, massively parallel SQL, managed
Fivetran / AirbyteManaged ELT connectors — handle extraction from 100+ sources
OrchestrationScheduling and monitoring pipelines — Apache Airflow is the standard
Data lineageTracking which sources produced which outputs — essential for debugging

The Problem

Every production application is optimized for writing data, not reading it for analysis. PostgreSQL is great for OLTP (Online Transaction Processing) — fast inserts, point lookups, ACID transactions. It's not designed for running GROUP BY and aggregations over 500 million rows. Worse, your business data is split across a dozen systems:

  • Customer profiles: Salesforce CRM
  • Orders and inventory: your application's PostgreSQL
  • Payments: Stripe
  • Marketing events: Mixpanel or Amplitude
  • Support tickets: Zendesk
  • Infrastructure costs: AWS billing

Answering "what's the lifetime value of customers who churned last quarter?" requires joining data from at least 3 of these systems. Doing this manually — exporting CSVs, loading into Excel, writing Python scripts — is slow, error-prone, and can't scale to run automatically every day.

ETL pipelines move all this data into one place, in a common schema, on a defined schedule, so analysts and data scientists can work with a unified, clean, up-to-date view of the business.

In plain terms: ETL is the plumbing of your data infrastructure. Just as your house's plumbing brings water from the municipal supply to every faucet, ETL brings data from every source system to your analytics layer — cleaned and formatted for immediate use.

Analogy: a supermarket's supply chain. Fruits come from farms (Extract), are inspected, sorted, and washed at a distribution center (Transform), then put on shelves where customers can find them (Load). You don't go to the farm to buy a banana — you go to the store where everything is organized for your convenience.

Step 1: Extract

Extraction is pulling data from source systems. Each source has its own interface and its own challenges:

Full snapshot: read the entire table (e.g., SELECT * FROM orders). Simple but expensive — for a 100-million-row table, this takes significant time and puts load on the production database. Acceptable for small tables or infrequent runs.

Incremental extraction: only fetch rows that have changed since the last run, using a WHERE updated_at > last_run_timestamp filter. Much faster, but requires a reliable updated_at column and careful tracking of the last-run watermark.

Change Data Capture (CDC): read the database's write-ahead log to capture every INSERT, UPDATE, and DELETE as it happens (Debezium is the standard tool). Real-time, low production impact, but more complex to set up.

API extraction: many SaaS sources (Stripe, Salesforce, HubSpot) only expose data via REST APIs. Extractors must handle pagination, rate limits, and auth token refresh. Managed connectors (Fivetran, Airbyte, Stitch) handle this for 100+ common sources.

In plain terms: extraction is about reading from sources that weren't designed to be read at scale, in a way that doesn't break them or miss data.

Staging area: extracted data goes first to a staging area — raw, unmodified. This is your source of truth for the raw data. If a transform has a bug, you can reprocess from staging without re-extracting from production. S3 buckets or schema-isolated "raw" tables in the warehouse are common staging areas.

Step 2: Transform

Transformation is where the real work happens. Raw data from source systems is messy:

Data quality issues: null values where they shouldn't be, inconsistent formats ("Jan 1, 2024" vs "2024-01-01" vs 1704067200), duplicate records (same event logged twice), invalid values (age = -5, price = -$100).

Schema inconsistencies: Salesforce stores customer IDs as "ACCT-001234"; your orders database uses integer IDs 42; they refer to the same customer but have no common key. Resolving this requires matching logic.

Unit and currency differences: prices in multiple currencies, weights in different units, timestamps in different timezones.

Business logic: "revenue" might mean gross revenue in the orders table, but your analysts want net revenue (after refunds). Computing this requires joining orders with refunds and applying the business definition.

Transformation tools:

SQL inside the warehouse (ELT pattern): CREATE TABLE analytics.daily_revenue AS SELECT date_trunc('day', created_at), sum(amount_usd) FROM raw.orders GROUP BY 1. Fast, maintainable, version-controlled with dbt.

Python/PySpark pipelines: for complex transformations (ML feature engineering, text processing, fuzzy matching), Python scripts running on Spark or pandas.

In plain terms: transformation is like mise en place in cooking — getting all ingredients prepped, uniformly cut, and organized before cooking starts. The cooking (analysis) is fast because prep was done properly.

Concrete sketch: a fintech startup extracts transaction data from their Stripe account and card processor. Raw data has: amounts in USD cents (Stripe) and USD dollars (card processor), timestamps in UTC and US/Eastern, merchant names in raw form ("SBUX #12345" = Starbucks). Transformation: unify to USD dollars, convert to UTC, apply merchant normalization rules (regex + lookup table), deduplicate on transaction fingerprint. The result: a clean transactions table analysts can query without thinking about data quality.

Step 3: Load

The transformed data is loaded into the destination — almost always a data warehouse for analytics workloads:

Data warehouses are purpose-built for analytics:

  • Columnar storage: data is stored column-by-column, not row-by-row. A query that sums the amount column reads only the amount column from disk — not all other columns. 10–100× faster for analytics aggregations.
  • Massively parallel processing (MPP): queries run distributed across hundreds of compute nodes simultaneously. A 1-billion-row aggregation that takes 5 minutes on PostgreSQL takes 3 seconds on Snowflake.
  • Separation of storage and compute: Snowflake, BigQuery, and Redshift all store data in S3/GCS/Azure Blob and spin up compute only when needed. You pay for storage constantly but compute only when running queries.

Load patterns:

  • Full refresh: truncate the table, reload everything. Simple but slow for large tables.
  • Append: add only new rows. Works for fact tables (orders, events) where history doesn't change.
  • Upsert (merge): update existing rows if key exists, insert if not. Needed when source records can change (customer email update).

Tiny example: Airbnb's data warehouse. Every booking, review, message, and host action is extracted from production services, transformed through hundreds of dbt models (cleaning, joining, denormalization), and loaded into their Hive/Presto warehouse. Data scientists query the warehouse to understand host supply vs guest demand, pricing optimization, and fraud signals — without touching production databases.

ELT: The Modern Pattern

Traditional ETL transforms data before loading it, using a separate compute layer (Spark, custom Python). Modern cloud warehouses are so powerful that it's often better to load raw data first and transform inside the warehouse:

Why ELT is preferred now:

  • Cloud warehouses are fast: Snowflake can run complex SQL transforms over 100GB in seconds — faster than most ETL frameworks
  • Easier reprocessing: if your transform logic changes, re-run the SQL from the raw data — no need to re-extract
  • SQL is universal: every data analyst knows SQL; fewer engineers know Spark
  • dbt makes it maintainable: dbt (Data Build Tool) lets you write transforms as SQL SELECT statements with version control, testing, and documentation

dbt workflow:

-- models/marts/daily_revenue.sql
SELECT
  date_trunc('day', created_at) AS date,
  sum(amount_usd)               AS gross_revenue,
  sum(amount_usd) - sum(refund_usd) AS net_revenue,
  count(*)                      AS order_count
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

dbt compiles this to CREATE TABLE marts.daily_revenue AS SELECT ..., runs it in Snowflake, and tests the output (revenue should be > 0, date should not be null).

Modern ELT stack: Fivetran or Airbyte (extract and load) + Snowflake/BigQuery (warehouse) + dbt (transform) + Tableau/Metabase (BI). Each component is best-in-class, composable, and well-supported.

The Trade-offs

PropertyTraditional ETLModern ELT
Transform locationSeparate compute (Spark)Inside the warehouse (SQL)
Storage costLower — only clean data storedHigher — raw data also stored
ReprocessingRe-run pipeline from sourceRe-run SQL from raw staging
Skill requiredPython/SparkSQL + dbt
FlexibilityHigh — Python is unlimitedSQL has limits (ML, text processing)
When to useComplex non-SQL transformsStandard data engineering

Why this matters for you

If you're building any product with more than a handful of data sources, you need a data pipeline. Start with ELT: use Fivetran or Airbyte to sync your production databases and SaaS tools to BigQuery or Snowflake; use dbt to write your transformation logic in SQL. This setup can be operational in a week and will handle petabytes when you need it. The second-most important decision: use Airflow (or Prefect, or Dagster) for orchestration — it handles scheduling, retries, alerting on failures, and data lineage tracking across your entire pipeline DAG.

Next: MapReduce — the algorithm that showed how to process petabytes of data by distributing computation across thousands of machines.

DIAGRAMDrag nodes · pan · pinch or double-click to zoom
FRAME 1 OF 6

Data silos — every system stores data in its own format, schema, and location. Answering cross-system questions is manual, error-prone, and slow.