Extract, transform, and load data from sources to destinations.
If you are new here: ETL stands for Extract, Transform, Load — and it describes the fundamental workflow for moving data from where it's produced to where it's analyzed. Your application stores orders in PostgreSQL. Your CRM stores customers in Salesforce. Your payment processor stores transactions in Stripe. None of these systems talk to each other, and none of them are optimized for analytics. An ETL pipeline extracts data from each source, transforms it (cleans, joins, standardizes, enriches), and loads it into a destination built for analytics — typically a data warehouse like Snowflake, BigQuery, or Redshift. From there, your data team runs SQL queries, builds dashboards, trains ML models, and answers business questions that span multiple source systems. Modern practice often reverses the order: ELT (Extract, Load, Transform) loads raw data first and transforms inside the warehouse using tools like dbt — taking advantage of cloud warehouses' massive parallel SQL processing power.
| Term | Plain meaning |
|---|---|
| ETL | Extract → Transform → Load: transform data before loading to destination |
| ELT | Extract → Load → Transform: load raw first, transform inside the warehouse |
| Data warehouse | A database optimized for analytics — columnar storage, fast aggregations, SQL |
| Staging area | Temporary storage for extracted raw data before transformation |
| dbt | Data Build Tool — write transforms as SQL SELECT statements, version-controlled |
| Snowflake / BigQuery | Cloud data warehouses — serverless, massively parallel SQL, managed |
| Fivetran / Airbyte | Managed ELT connectors — handle extraction from 100+ sources |
| Orchestration | Scheduling and monitoring pipelines — Apache Airflow is the standard |
| Data lineage | Tracking which sources produced which outputs — essential for debugging |
Every production application is optimized for writing data, not reading it for analysis. PostgreSQL is great for OLTP (Online Transaction Processing) — fast inserts, point lookups, ACID transactions. It's not designed for running GROUP BY and aggregations over 500 million rows. Worse, your business data is split across a dozen systems:
Answering "what's the lifetime value of customers who churned last quarter?" requires joining data from at least 3 of these systems. Doing this manually — exporting CSVs, loading into Excel, writing Python scripts — is slow, error-prone, and can't scale to run automatically every day.
ETL pipelines move all this data into one place, in a common schema, on a defined schedule, so analysts and data scientists can work with a unified, clean, up-to-date view of the business.
In plain terms: ETL is the plumbing of your data infrastructure. Just as your house's plumbing brings water from the municipal supply to every faucet, ETL brings data from every source system to your analytics layer — cleaned and formatted for immediate use.
Analogy: a supermarket's supply chain. Fruits come from farms (Extract), are inspected, sorted, and washed at a distribution center (Transform), then put on shelves where customers can find them (Load). You don't go to the farm to buy a banana — you go to the store where everything is organized for your convenience.
Extraction is pulling data from source systems. Each source has its own interface and its own challenges:
Full snapshot: read the entire table (e.g., SELECT * FROM orders). Simple but expensive — for a 100-million-row table, this takes significant time and puts load on the production database. Acceptable for small tables or infrequent runs.
Incremental extraction: only fetch rows that have changed since the last run, using a WHERE updated_at > last_run_timestamp filter. Much faster, but requires a reliable updated_at column and careful tracking of the last-run watermark.
Change Data Capture (CDC): read the database's write-ahead log to capture every INSERT, UPDATE, and DELETE as it happens (Debezium is the standard tool). Real-time, low production impact, but more complex to set up.
API extraction: many SaaS sources (Stripe, Salesforce, HubSpot) only expose data via REST APIs. Extractors must handle pagination, rate limits, and auth token refresh. Managed connectors (Fivetran, Airbyte, Stitch) handle this for 100+ common sources.
In plain terms: extraction is about reading from sources that weren't designed to be read at scale, in a way that doesn't break them or miss data.
Staging area: extracted data goes first to a staging area — raw, unmodified. This is your source of truth for the raw data. If a transform has a bug, you can reprocess from staging without re-extracting from production. S3 buckets or schema-isolated "raw" tables in the warehouse are common staging areas.
Transformation is where the real work happens. Raw data from source systems is messy:
Data quality issues: null values where they shouldn't be, inconsistent formats ("Jan 1, 2024" vs "2024-01-01" vs 1704067200), duplicate records (same event logged twice), invalid values (age = -5, price = -$100).
Schema inconsistencies: Salesforce stores customer IDs as "ACCT-001234"; your orders database uses integer IDs 42; they refer to the same customer but have no common key. Resolving this requires matching logic.
Unit and currency differences: prices in multiple currencies, weights in different units, timestamps in different timezones.
Business logic: "revenue" might mean gross revenue in the orders table, but your analysts want net revenue (after refunds). Computing this requires joining orders with refunds and applying the business definition.
Transformation tools:
SQL inside the warehouse (ELT pattern): CREATE TABLE analytics.daily_revenue AS SELECT date_trunc('day', created_at), sum(amount_usd) FROM raw.orders GROUP BY 1. Fast, maintainable, version-controlled with dbt.
Python/PySpark pipelines: for complex transformations (ML feature engineering, text processing, fuzzy matching), Python scripts running on Spark or pandas.
In plain terms: transformation is like mise en place in cooking — getting all ingredients prepped, uniformly cut, and organized before cooking starts. The cooking (analysis) is fast because prep was done properly.
Concrete sketch: a fintech startup extracts transaction data from their Stripe account and card processor. Raw data has: amounts in USD cents (Stripe) and USD dollars (card processor), timestamps in UTC and US/Eastern, merchant names in raw form ("SBUX #12345" = Starbucks). Transformation: unify to USD dollars, convert to UTC, apply merchant normalization rules (regex + lookup table), deduplicate on transaction fingerprint. The result: a clean transactions table analysts can query without thinking about data quality.
The transformed data is loaded into the destination — almost always a data warehouse for analytics workloads:
Data warehouses are purpose-built for analytics:
amount column reads only the amount column from disk — not all other columns. 10–100× faster for analytics aggregations.Load patterns:
Tiny example: Airbnb's data warehouse. Every booking, review, message, and host action is extracted from production services, transformed through hundreds of dbt models (cleaning, joining, denormalization), and loaded into their Hive/Presto warehouse. Data scientists query the warehouse to understand host supply vs guest demand, pricing optimization, and fraud signals — without touching production databases.
Traditional ETL transforms data before loading it, using a separate compute layer (Spark, custom Python). Modern cloud warehouses are so powerful that it's often better to load raw data first and transform inside the warehouse:
Why ELT is preferred now:
dbt workflow:
-- models/marts/daily_revenue.sql
SELECT
date_trunc('day', created_at) AS date,
sum(amount_usd) AS gross_revenue,
sum(amount_usd) - sum(refund_usd) AS net_revenue,
count(*) AS order_count
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1dbt compiles this to CREATE TABLE marts.daily_revenue AS SELECT ..., runs it in Snowflake, and tests the output (revenue should be > 0, date should not be null).
Modern ELT stack: Fivetran or Airbyte (extract and load) + Snowflake/BigQuery (warehouse) + dbt (transform) + Tableau/Metabase (BI). Each component is best-in-class, composable, and well-supported.
| Property | Traditional ETL | Modern ELT |
|---|---|---|
| Transform location | Separate compute (Spark) | Inside the warehouse (SQL) |
| Storage cost | Lower — only clean data stored | Higher — raw data also stored |
| Reprocessing | Re-run pipeline from source | Re-run SQL from raw staging |
| Skill required | Python/Spark | SQL + dbt |
| Flexibility | High — Python is unlimited | SQL has limits (ML, text processing) |
| When to use | Complex non-SQL transforms | Standard data engineering |
If you're building any product with more than a handful of data sources, you need a data pipeline. Start with ELT: use Fivetran or Airbyte to sync your production databases and SaaS tools to BigQuery or Snowflake; use dbt to write your transformation logic in SQL. This setup can be operational in a week and will handle petabytes when you need it. The second-most important decision: use Airflow (or Prefect, or Dagster) for orchestration — it handles scheduling, retries, alerting on failures, and data lineage tracking across your entire pipeline DAG.
Next: MapReduce — the algorithm that showed how to process petabytes of data by distributing computation across thousands of machines.
Data silos — every system stores data in its own format, schema, and location. Answering cross-system questions is manual, error-prone, and slow.