Thursday, March 26, 2026

STEP-BY-STEP: How dbt Builds the DAG

 

1. Parse files
2. Identify refs/sources
3. Build nodes
4. Create dependency graph
5. Validate DAG
6. Topological sort
7. Compile SQL
8. Execute DAG
9. Render UI graph


dbt manifest.json internals

 Deep into dbt’s manifest.json—this is one of the most important internal artifacts in dbt and is the brain of your project DAG.

What is manifest.json?

manifest.json is a compiled metadata file generated when you run:

</> Bash

dbt run

dbt compile

dbt docs generate

๐Ÿ“Œ It contains:

  • All models, sources, tests, macros
  • Dependency graph (DAG)
  • Compiled SQL
  • Column metadata
  • Lineage relationships

๐Ÿ‘‰ Think of it as:

dbt project → parsed → compiled → manifest.json (single source of truth)

๐Ÿ“ฆ High-Level Structure

{
"metadata": {},
"nodes": {},
"sources": {},
"macros": {},
"parent_map": {},
"child_map": {},
"docs": {},
"exposures": {},
"metrics": {}
}

๐Ÿ” 1. metadata Section

๐Ÿ”น What it contains

"metadata": {
"dbt_version": "1.x.x",
"project_name": "retail_dbt",
"generated_at": "timestamp",
"adapter_type": "snowflake"
}

๐Ÿ”น Why it matters

  • Tracks dbt version compatibility
  • Helps debugging pipeline issues

๐Ÿงฉ 2. nodes (CORE of dbt)

This is the most important section.

๐Ÿ”น What are nodes?

Everything dbt builds:

  • models
  • tests
  • seeds
  • snapshots

๐Ÿ”น Example: Model Node

"model.retail_dbt.stg_customers": { "resource_type": "model", "name": "stg_customers", "database": "RETAIL_DB", "schema": "STAGING", "alias": "stg_customers", "raw_code": "SELECT * FROM {{ source('raw','customers_raw') }}", "compiled_code": "SELECT * FROM RETAIL_DB.RAW.CUSTOMERS_RAW", "depends_on": { "nodes": ["source.retail_dbt.raw.customers_raw"] }, "config": { "materialized": "view" } }

๐Ÿ”น Key Fields Explained

raw_code

  • Your original SQL with Jinja

compiled_code

  • Final SQL sent to Snowflake

depends_on

  • Defines DAG edges

config

  • Materialization (view/table/incremental)

๐ŸŒ 3. sources

๐Ÿ”น Example

"source.retail_dbt.raw.customers_raw": { "database": "RETAIL_DB", "schema": "RAW", "identifier": "CUSTOMERS_RAW" }

๐Ÿ”น Purpose

  • Maps dbt → physical tables
  • Enables lineage tracking

๐Ÿงฌ 4. parent_map (UPSTREAM)

๐Ÿ”น Example

"model.retail_dbt.stg_customers": [ "source.retail_dbt.raw.customers_raw" ]

๐Ÿ”น Meaning

  • Who feeds into this model

๐Ÿ”— 5. child_map (DOWNSTREAM)

๐Ÿ”น Example

"model.retail_dbt.stg_customers": [ "model.retail_dbt.dim_customers" ]

๐Ÿ”น Meaning

  • Who depends on this model
๐Ÿง  DAG Insight

Together:

parent_map + child_map = full DAG graph

This powers:

  • dbt lineage UI
  • model execution order

๐Ÿงฐ 6. macros

๐Ÿ”น Example

"macro.dbt_utils.generate_surrogate_key": { "name": "generate_surrogate_key", "macro_sql": "md5(concat(...))" }

๐Ÿ”น Purpose

  • Stores reusable logic
  • Used during compilation

๐Ÿงช 7. tests

Stored inside nodes

๐Ÿ”น Example

"test.retail_dbt.unique_customer_id": { "resource_type": "test", "depends_on": { "nodes": ["model.retail_dbt.stg_customers"] } }

๐Ÿ”น Purpose

  • Defines validation logic
  • Linked to models

๐Ÿ“š 8. docs & exposures

๐Ÿ”น Docs

  • Column descriptions
  • Model descriptions

๐Ÿ”น Exposures

"exposure.dashboard.sales_dashboard": { "type": "dashboard", "depends_on": { "nodes": ["model.retail_dbt.fact_orders"] } }

๐Ÿ‘‰ Connects dbt → BI tools like:

  • Tableau
  • Power BI

HOW dbt USES manifest.json INTERNALLY

Step-by-step:

1. Parse Phase

  • Reads SQL + YAML → builds nodes

2. Compile Phase

  • Resolves:
    • ref()
    • source()
    • macros

3. DAG Build

  • Uses:
    • depends_on
    • parent_map

4. Execution Planning

  • Orders models correctly

5. Run Phase

  • Executes compiled_code

6. Docs Generation

  • Uses manifest for lineage graph

๐ŸŽฏ Key Takeaways

  • manifest.json = central brain of dbt
  • Stores:
    • models
    • dependencies
    • compiled SQL
  • Powers:
    • execution
    • lineage
    • CI/CD
  • Essential for:
    • debugging
    • optimization
    • orchestration

production-level walkthrough

 Below is a deep, production-level walkthrough of each step in your dbt pipeline—from raw → staging → incremental → SCD → marts → testing → deployment—including:

  • dbt internal components involved
  • SQL / Python role
  • Why the step exists
  • How Git supports each step

STEP 1: RAW DATA EXTRACT (Sources)

๐Ÿ”น What happens

  • Data lands in Snowflake raw schema via:
    • COPY INTO / Snowpipe
  • dbt does NOT ingest, but registers sources

๐Ÿ”น dbt Internal Components

  • Source Config (sources.yml)
  • Parser
  • Adapter (Snowflake adapter)

๐Ÿ”น SQL / Python Role

SELECT * FROM {{ source('raw','customers_raw') }}

  • SQL references raw tables
  • Python (optional): external ingestion pipelines

๐Ÿ”น Why required

  • Establish trusted entry point
  • Enables lineage tracking
  • Prevents direct table coupling

๐Ÿ”น Git Role

  • Version control for:
    • sources.yml
  • Tracks schema evolution
  • Enables rollback if source changes break pipeline.

STEP 2: STAGING LAYER

๐Ÿ”น What happens

  • Clean, standardize, rename columns
  • Apply light transformations

๐Ÿ”น dbt Internal Components

  • Model Parser
  • Jinja Engine (macros)
  • Compiler
  • Execution Engine

๐Ÿ”น SQL Example

SELECT
  customer_id,
  LOWER(email) AS email,
  TRIM(name) AS name
FROM {{ source('raw','customers_raw') }}

๐Ÿ”น Why required

  • Removes inconsistencies
  • Creates reusable canonical layer

๐Ÿ”น Git Role

  • Tracks column renaming decisions
  • Enables code review for transformations
  • Collaboration across teams

STEP 3: INCREMENTAL MODELS

๐Ÿ”น What happens

  • Load only new or changed data
  • Avoid full table rebuilds

๐Ÿ”น dbt Internal Components

  • Materialization Engine (incremental)
  • Macro Engine (is_incremental())
  • Adapter-specific SQL generation

๐Ÿ”น SQL Example

SELECT *

FROM {{ ref('stg_customers') }}


{% if is_incremental() %}

WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})

{% endif %}

๐Ÿ”น Why required

  • Improves performance
  • Reduces compute cost in Snowflake

๐Ÿ”น Git Role

  • Tracks logic changes for incremental filters
  • Prevents accidental full refresh bugs
  • Enables safe experimentation via branches

STEP 4: SCD TYPE 2 (HISTORICAL TRACKING)

๐Ÿ”น What happens

  • Tracks historical changes in dimensions
  • Maintains:
    • surrogate keys
    • effective dates
    • current flag

๐Ÿ”น dbt Internal Components

  • Custom Macros (SCD logic)
  • Incremental + MERGE strategy
  • Dependency Graph (ref)

๐Ÿ”น SQL Example (MERGE)

MERGE INTO dim_customers t

USING stg_customers s

ON t.customer_id = s.customer_id AND t.is_current = TRUE


WHEN MATCHED AND t.email <> s.email THEN

UPDATE SET is_current = FALSE, effective_end_date = CURRENT_TIMESTAMP


WHEN NOT MATCHED THEN

INSERT (...)

๐Ÿ”น Why required

  • Enables:
    • historical reporting
    • audit tracking
    • slowly changing dimensions

๐Ÿ”น Git Role

  • Critical for:
    • tracking SCD logic changes
    • auditing business rule evolution
  • Enables peer review for complex logic

STEP 5: DATA MART (GOLD LAYER)

๐Ÿ”น What happens

  • Build fact and dimension tables
  • Optimize for BI queries

๐Ÿ”น dbt Internal Components

  • Ref Graph (DAG builder)
  • Materializations (table/view)
  • Query Planner

๐Ÿ”น SQL Example

SELECT
  o.order_id,
  c.customer_sk,
  o.order_amount
FROM {{ ref('fact_orders') }} o
JOIN {{ ref('dim_customers') }} c

๐Ÿ”น Why required

  • Serves:
    • dashboards
    • analytics
  • Improves performance via denormalization

๐Ÿ”น Git Role

  • Tracks business logic
  • Maintains consistency across metrics
  • Enables versioned analytics definitions

STEP 6: TESTING & VALIDATION

๐Ÿ”น What happens

  • Validate data quality:
    • uniqueness
    • null checks
    • referential integrity

๐Ÿ”น dbt Internal Components

  • Test Runner
  • Schema YAML parser
  • Assertion engine

๐Ÿ”น YAML Example

columns:
  - name: customer_id
    tests:
      - not_null
      - unique

๐Ÿ”น Why required

  • Prevents bad data propagation
  • Ensures trust in analytics

๐Ÿ”น Git Role

  • Tracks test coverage
  • Enforces quality via CI/CD

  • Prevents bad merges
STEP 7: DOCUMENTATION & DEPLOYMENT

๐Ÿ”น What happens

  • Generate lineage docs
  • Deploy models via jobs

๐Ÿ”น dbt Internal Components

  • Docs Generator
  • Manifest.json
  • Run Artifacts
  • DAG Renderer

๐Ÿ”น Commands

dbt docs generate
dbt docs serve

๐Ÿ”น Why required

  • Provides:
    • lineage graph
    • column-level documentation
  • Helps onboarding & governance

๐Ÿ”น Git Role

  • Stores:
    • documentation YAML
  • Enables:
    • versioned lineage
    • auditability

Internal Flow inside dbt:

  1. Parser
    • Reads SQL + YAML
  2. Jinja/Macro Engine
    • Expands dynamic logic
  3. Compiler
    • Converts to raw SQL
  4. Adapter (Snowflake)
    • Optimizes SQL
  5. Execution Engine
    • Runs queries
  6. Test Engine
    • Validates output
  7. Docs Generator
    • Builds lineage + metadata
RAW → SOURCE → STAGING → INCREMENTAL → SCD → MART → TEST → DOCS

๐Ÿš€ Key Takeaways

  • dbt is a compiler + orchestrator, not ingestion tool
  • SQL is the primary transformation layer
  • Python is optional (advanced models)
  • Git enables:
    • collaboration
    • versioning
    • CI/CD
  • Each step builds trust, performance, and scalability

Inside dbt: How data moves from Raw to Refined

 



dbt internal components : dbt model using SQL or Python

 This diagram presents a high-level architectural blueprint of how dbt Core acts as the orchestration and transformation layer between raw source data and a formalized data warehouse. While the diagrams make it look like data "flows through" dbt, the fundamental takeaway for an architect is that dbt only handles the logic and orchestration (the control plane); the data processing (the data plane) is entirely pushed down to the target warehouse.

1. Source Data (Raw)

  • The Ingestion Layer: This represents the raw, "bronze-layer" data that has been landed in your cloud data warehouse (e.g., Snowflake, BigQuery) by your ELT tools (like Fivetran, Airbyte, or custom Python scripts).

  • SME Insight: From a dbt perspective, these are entirely external tables that dbt does not manage. We only interact with them through {{ source() }} macros, which allow us to define their location and properties (like freshness) in YAML.

  • Architectural Role: It is the foundational substrate for all downstream transformation. The data must be accessible by the dbt adapter’s credentials.

2. dbt Project & Model Definitions (.sql & .py)

  • The Blueprint: This is the developer's workspace—the codebase managed in Git. It contains all the instructions for how the data should be modeled.

  • SQL Model: The primary declarative method. It uses Select Statements combined with Jinja templating to create modular, reusable logic. config() macros at the top define how the database object will be materialized (as a table, view, etc.).

  • Python Model (SME Note): Introduced more recently, these models allow data teams to leverage Python libraries (like Pandas or scikit-learn) for operations that are impossible or highly complex in SQL (like machine learning inference or advanced statistical transformations). They are typically executed in the target warehouse's native Python environment (e.g., Snowflake Snowpark or BigQuery’s pandas API).

  • Principal Engineer View: The dbt project is where the principles of Software Engineering (version control, modularity, dry code) are applied to Data Engineering.

3. Manifest/Parser

  • The Project Cataloger: The Manifest/Parser is the "brain" that reads the entire project. It scans every .sql, .py, and .yml file in your repository to build an in-memory inventory of your models, tests, seeds, sources, and snapshots.

  • Dependency Resolution (DAG): Its critical output is the creation of the Directed Acyclic Graph (DAG). By reading the {{ ref() }} and {{ source() }} macros, it determines the order in which models must be built. If model_B refs model_A, it knows to build model_A first.

  • Validation: The parser validates the syntax, project structure, and config settings to ensure the project is "runnable." It’s the gatekeeper. If the parser finds an error, dbt stops.

4. Jinja Renderer

  • The Dynamic Engine: The parser identified where {{ ref() }} was used, but the Jinja Renderer is what actually compiles it. This is where the magic of "declarative SQL" comes from.

  • Environment Context: The renderer replaces abstract references like {{ ref('customers') }} with the fully qualified, environment-specific database paths (e.g., prod_db.core.customers in production, or dev_db.alice_schema.customers in development). This single feature makes code-switching between environments seamless.

  • Logic Execution: It processes {% if %}, {% for %}, and custom macros to dynamically generate different SQL or configurations based on input variables or environment settings.

5. Compilation Engine

  • The Transpiler: Now that Jinja has replaced the abstract terms, the Compilation Engine takes over. Its job is to generate the final, pure-dialect SQL that is ready to be executed.

  • Wrapping the Logic: This engine takes your SELECT query and wraps it in the necessary DDL/DML, depending on the materialization setting. For a materialized='table', it will wrap your SELECT in a CREATE OR REPLACE TABLE AS.... For an incremental model, it will generate a complex MERGE or INSERT statement based on the logic you defined.

  • Adapter Awareness: The compilation engine uses the Target DB Adapters (see below) to know exactly what SQL dialect (Snowflake, BigQuery, Postgres) to generate.

6. Artifacts Generator

  • The Persistent Memory: As dbt runs, it saves its state and knowledge into two main files:

    • manifest.json: A full, JSON-readable dump of everything the Manifest/Parser found. This is dbt’s "project-at-rest" state. It is used downstream by other tools and for State-Based Execution, allowing dbt to run only the models that have changed since the last run.

    • catalog.json: A JSON file generated after a dbt docs generate command. It contains the schema, column definitions, and data types from the target warehouse for all managed models.

  • Metadata Foundation: These artifacts are essential for dbt’s automated documentation website and for integration with modern data catalogs.

7. Execution Engine (dbt-core)

  • The Conductor: The Execution Engine is the final orchestrator. It doesn't perform data work itself, but it knows when and where that work should happen.

  • Job Management: It takes the generated SQL/Python from the Compilation Engine and dispatches it to the cloud data warehouse via the established database connection. It manages the threads (concurrency), sending parallel queries to the warehouse when the DAG allows.

  • SME Insight: The Execution Engine handles the lifecycle of the job. It establishes the connection, starts the query, waits for the result, records performance metrics (start time, end time, status), and handles the commit/rollback logic on completion or error. It is the component that makes dbt a full-fledged deployment coordinator.

8. Target DB Adapters (Snowflake, BigQuery, etc.)

  • The Translation Layer: This is the abstraction layer that allows dbt Core to remain dialect-agnostic. The adapters (e.g., dbt-snowflake, dbt-bigquery) translate abstract dbt commands into the specific API calls and SQL dialects required by that database.

  • Database-Specific Operations: The adapter knows that a table creation statement in Snowflake is different from BigQuery. It handles the nuances of connection authentication, data types, and transactional control.

9. Transformed Data Models & Transformation Execution

  • The Destination: This is the finalized, modeled data that is now ready for analytics (the "gold-layer").

  • Pushed-Down Execution: This is the single most critical concept. All data movement and processing occur entirely inside the Target Cloud Data Warehouse.

    • SQL models are executed as CREATE TABLE AS... or INSERT/MERGE SQL commands directly against the warehouse engine.

    • Python models are packaged and sent to the warehouse’s internal runtime (like Snowpark UDFs) where they execute in specialized, serverless compute containers close to the data.

  • Output: The result is high-quality, trusted tables like dim_customers and fct_orders which are query-ready for your BI tools.

Wednesday, March 25, 2026

dbt_codegen package

 What is dbt_codegen (Quick Context)

dbt_codegen is a macro package that auto-generates repetitive dbt code (SQL + YAML) by introspecting your warehouse metadata

๐Ÿ‘‰ It reduces:

  • Manual boilerplate
  • Human error in YAML/docs/tests
  • Time-to-model for new datasets

Core Use Cases (with Architecture Thinking)

1) ๐Ÿงฑ Generate Source YAML from Raw Tables

Why it matters

  • Standardizes ingestion layer
  • Enables lineage + freshness + testing
Example:

{{ codegen.generate_source(
    schema_name='raw',
    database_name='analytics_db'
) }}

Output (auto-generated)

</> YAML
version: 2

sources:
  - name: raw
    tables:
      - name: customers
      - name: orders

Architect Insight

  • Use this when onboarding new schemas from ingestion tools (Fivetran, Airbyte)
  • Combine with:
    • freshness tests
    • source-level SLAs

2) Generate Base Models (Staging Layer)

Why it matters

  • Enforces consistent staging patterns
  • Accelerates Medallion architecture (Bronze → Silver)
Example

</> SQL
{{ codegen.generate_base_model(
    source_name='raw',
    table_name='orders'
) }}

Output

</> SQL
select
    id,
    customer_id,
    order_date,
    amount
from {{ source('raw', 'orders') }}

Architect Insight
  • Pair with naming convention: stg_<source>__<table>
  • Add:
    • column casting
    • null handling
    • deduplication logic
3) Generate Model YAML (Docs + Tests)

Why it matters
  • Ensures every model is documented and testable
  • Critical for governance + data contracts
Example

</> SQL
{{ codegen.generate_model_yaml(
    model_names=['stg_orders']
) }}

Output

</> YAML
version: 2

models:
  - name: stg_orders
    columns:
      - name: id
        tests:
          - not_null
          - unique

Architect Insight

  • Use this to enforce:
    • Data quality SLAs
    • Contract-first modeling
  • Extend with:
    • accepted_values
    • relationships

4) Generate Column Lists Dynamically

Why it matters

  • Avoids SELECT *
  • Prevents breakage when schema evolves
Example
</> SQL
select
    {{ codegen.get_columns_in_relation(
        ref('stg_orders')
    ) }}
from {{ ref('stg_orders') }}

Output

</> SQL
select id, customer_id, order_date, amount

Architect Insight

  • Combine with:
    • incremental models
    • snapshot logic
  • Helps with schema drift handling
5) ๐Ÿงช Generate Generic Tests Quickly

Why it matters

  • Accelerates data quality adoption
Example

</> SQL
{{ codegen.generate_model_yaml(
    model_names=['fct_orders'],
    include_tests=True
) }}

Architect Insight

  • Use for:
    • onboarding new teams
    • enforcing platform-wide testing standards

6) ๐Ÿ”„ Automate Documentation at Scale

Why it matters

  • Documentation is often neglected → dbt_codegen fixes that

Example Workflow

  1. Run macro
  2. Paste into .yml
  3. Add descriptions later

Architect Insight

  • Integrate into CI/CD:
    • auto-generate → PR → review

7) ๐Ÿงฉ Metadata-Driven Development

Why it matters

  • Moves toward declarative analytics engineering

Example Pattern

  • Introspect schema → generate models → enrich logic

Architect Insight

  • Combine with:
    • information_schema queries
    • data catalogs
    • AI-assisted model generation

๐Ÿ”ท How dbt_codegen Works in dbt Cloud (Internals)

Step-by-step Flow

1) Metadata Introspection

  • Queries warehouse system tables
    (information_schema.columns)

2) Macro Execution

  • Jinja macros build:
    • SQL
    • YAML
    • Tests

3) Output Rendering

  • Output appears in:
    • dbt Cloud IDE
    • CLI logs

4) Manual/Automated Integration

  • Developer copies OR pipelines inject into repo



End-to-End dbt ecosystem

 





This diagram visualizes the complete end-to-end dbt ecosystem incorporating all the technologies we discussed (Snowflake, Databricks, Fabric, Python models, Airflow, and advanced CI/CD).

The flow moves from left (Sources) to right (Marts and Delivery). Here is a detailed breakdown of each architectural stage and the data flow.

Data Sources & The Staging Layer (The DLH/W)

This is the intake and initial storage area. Raw data arrives from operational systems via Airflow-orchestrated ingestion tasks (using Fivetran/Airbyte) and lands in the unified Data Lakehouse/Warehouse (marked as a large cube supporting Snowflake Horizon, Databricks Unity, and MS Fabric OneLake).

The diagram illustrates two critical concepts we’ve covered:

  • Layered Storage: The warehouse is partitioned into a Bronze/Raw zone (raw data) and a Silver/Staging zone (initial dbt stg_ clean-up models).

  • The Codegen Loop: A specialized codegen operation is shown running outside the main DAG execution. It queries the data warehouse INFORMATION_SCHEMA (the process we visualized in image_0.png) and automatically generates the YAML source definitions and SQL base models. This workflow accelerates development time and reduces manual errors when onboarding new datasets.

The Transformation & Orchestration Layer (The Core DAG)

This is the heart of the modern dbt solution. The execution is segmented into the internal steps crucial for production resilience.

The Internal dbt Lifecycle:

  1. Model Parsing: Dbt-core (or Cloud) builds the dependency graph (DAG) by interpreting all SQL ref() macros.

  2. Compile & Manifest: The manifest.json is generated, which is the artifact required by Airflow Cosmos for dynamic task generation.

  3. Model Execution (dbt Build): The DAG is executed against the warehouse. The diagram explicitly displays polyglot engineering (SQL mixed with Python):

    • SQL Path: The execution of standard relational models (int_campaign_performance flowing into fct_ad_performance) which are pushed down to the warehouse engine.

    • Python Path: The execution of a complex model (py_ml_customer_segmentation). This triggers a distinct Snowpark Python Call (in Snowflake) or a PySpark Call (in Databricks/Fabric), processing the statistics and saving the result.

  4. Test & Governance: Once built, data quality assertions are run (dbt-expectations) and YAML constraints (Schema Contracts) are enforced.

Orchestration & Governance Sync:

  • Airflow DAG: The Airflow orchestration (via Astronomer Cosmos provider) maps directly to the dbt steps: Parse -> Compile -> Execute (Dynamic Task).

  • Catalog Sync: Successful dbt builds are synchronized with the respective warehouse governance tools, such as Databricks Unity Catalog, ensuring fine-grained access control is applied to the newly created data marts.

Production & Delivery

This column represents the finalized, production-grade state. The built, tested, and governed models are available in the Gold/Marts presentation schema. The data is now available to consumption engines for different business use cases: BI Dashboards, ML models (for inference), and Operational Sync (Reverse ETL to push data back to Salesforce, etc.).

Advanced Platform Engineering: CI/CD Pipeline (Bottom Right)

This flow details how a Principal Engineer ensures data platform stability:

  1. Commit Trigger: A developer pushes code (image_1.png references).

  2. Slim CI Build: The pipeline runs only the modified models and their downstream dependents. It uses the dbt Build --defer --state flag to "defer" to the production schema for unmodified upstream tables, saving execution cost.

  3. Blue/Green Deployment: For the final release, a Zero-Copy Clone (Snowflake specific) is created to instantly swap production and staging environments, allowing zero-downtime, safe rollbacks.