Thursday, March 26, 2026
STEP-BY-STEP: How dbt Builds the DAG
dbt manifest.json internals
Deep into dbt’s manifest.json—this is one of the most important internal artifacts in dbt and is the brain of your project DAG.
What is manifest.json?
manifest.json is a compiled metadata file generated when you run:
</> Bash
dbt run
dbt compile
dbt docs generate
๐ It contains:
- All models, sources, tests, macros
- Dependency graph (DAG)
- Compiled SQL
- Column metadata
- Lineage relationships
๐ Think of it as:
dbt project → parsed → compiled → manifest.json (single source of truth)๐ฆ High-Level Structure
{
"metadata": {},
"nodes": {},
"sources": {},
"macros": {},
"parent_map": {},
"child_map": {},
"docs": {},
"exposures": {},
"metrics": {}
}๐ 1.
metadataSection๐น What it contains
"metadata": {
"dbt_version": "1.x.x",
"project_name": "retail_dbt",
"generated_at": "timestamp",
"adapter_type": "snowflake"
}๐น Why it matters
- Tracks dbt version compatibility
- Helps debugging pipeline issues
๐งฉ 2.
nodes(CORE of dbt)This is the most important section.
๐น What are nodes?
Everything dbt builds:
- models
- tests
- seeds
- snapshots
๐น Example: Model Node
"model.retail_dbt.stg_customers": { "resource_type": "model", "name": "stg_customers", "database": "RETAIL_DB", "schema": "STAGING", "alias": "stg_customers", "raw_code": "SELECT * FROM {{ source('raw','customers_raw') }}", "compiled_code": "SELECT * FROM RETAIL_DB.RAW.CUSTOMERS_RAW", "depends_on": { "nodes": ["source.retail_dbt.raw.customers_raw"] }, "config": { "materialized": "view" } }๐น Key Fields Explained
✅
raw_code
- Your original SQL with Jinja
✅
compiled_code
- Final SQL sent to Snowflake
✅
depends_on
- Defines DAG edges
✅
config
- Materialization (view/table/incremental)
๐ 3.
sources๐น Example
"source.retail_dbt.raw.customers_raw": { "database": "RETAIL_DB", "schema": "RAW", "identifier": "CUSTOMERS_RAW" }๐น Purpose
- Maps dbt → physical tables
- Enables lineage tracking
๐งฌ 4.
parent_map(UPSTREAM)๐น Example
"model.retail_dbt.stg_customers": [ "source.retail_dbt.raw.customers_raw" ]๐น Meaning
- Who feeds into this model
๐ 5.
child_map(DOWNSTREAM)๐น Example
"model.retail_dbt.stg_customers": [ "model.retail_dbt.dim_customers" ]๐น Meaning
๐ง DAG Insight
- Who depends on this model
Together:
parent_map + child_map = full DAG graphThis powers:
- dbt lineage UI
- model execution order
"macro.dbt_utils.generate_surrogate_key": { "name": "generate_surrogate_key", "macro_sql": "md5(concat(...))" }๐งฐ 6.
macros๐น Example
๐น Purpose
- Stores reusable logic
- Used during compilation
๐งช 7.
testsStored inside
nodes๐น Example
"test.retail_dbt.unique_customer_id": { "resource_type": "test", "depends_on": { "nodes": ["model.retail_dbt.stg_customers"] } }๐น Purpose
- Defines validation logic
- Linked to models
๐ 8.
docs&exposures๐น Docs
- Column descriptions
- Model descriptions
๐น Exposures
"exposure.dashboard.sales_dashboard": { "type": "dashboard", "depends_on": { "nodes": ["model.retail_dbt.fact_orders"] } }๐ Connects dbt → BI tools like:
- Tableau
- Power BI
HOW dbt USES manifest.json INTERNALLY
Step-by-step:
1. Parse Phase
- Reads SQL + YAML → builds nodes
2. Compile Phase
- Resolves:
ref()source()- macros
3. DAG Build
- Uses:
depends_onparent_map4. Execution Planning
- Orders models correctly
5. Run Phase
- Executes
compiled_code6. Docs Generation
- Uses manifest for lineage graph
๐ฏ Key Takeaways
manifest.json= central brain of dbt- Stores:
- models
- dependencies
- compiled SQL
- Powers:
- execution
- lineage
- CI/CD
- Essential for:
- debugging
- optimization
- orchestration
production-level walkthrough
Below is a deep, production-level walkthrough of each step in your dbt pipeline—from raw → staging → incremental → SCD → marts → testing → deployment—including:
- dbt internal components involved
- SQL / Python role
- Why the step exists
- How Git supports each step
STEP 1: RAW DATA EXTRACT (Sources)
๐น What happens
-
Data lands in Snowflake raw schema via:
- COPY INTO / Snowpipe
- dbt does NOT ingest, but registers sources
๐น dbt Internal Components
-
Source Config (
sources.yml) - Parser
- Adapter (Snowflake adapter)
๐น SQL / Python Role
SELECT * FROM {{ source('raw','customers_raw') }}
- SQL references raw tables
- Python (optional): external ingestion pipelines
๐น Why required
- Establish trusted entry point
- Enables lineage tracking
- Prevents direct table coupling
๐น Git Role
-
Version control for:
-
sources.yml
-
- Tracks schema evolution
- Enables rollback if source changes break pipeline.
STEP 2: STAGING LAYER
๐น What happens
- Clean, standardize, rename columns
- Apply light transformations
๐น dbt Internal Components
- Model Parser
- Jinja Engine (macros)
- Compiler
- Execution Engine
๐น SQL Example
๐น Why required
- Removes inconsistencies
- Creates reusable canonical layer
๐น Git Role
- Tracks column renaming decisions
- Enables code review for transformations
- Collaboration across teams
STEP 3: INCREMENTAL MODELS
๐น What happens
- Load only new or changed data
- Avoid full table rebuilds
๐น dbt Internal Components
-
Materialization Engine (
incremental) -
Macro Engine (
is_incremental()) - Adapter-specific SQL generation
๐น SQL Example
SELECT *
FROM {{ ref('stg_customers') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
๐น Why required
- Improves performance
- Reduces compute cost in Snowflake
๐น Git Role
- Tracks logic changes for incremental filters
- Prevents accidental full refresh bugs
- Enables safe experimentation via branches
STEP 4: SCD TYPE 2 (HISTORICAL TRACKING)
๐น What happens
- Tracks historical changes in dimensions
-
Maintains:
- surrogate keys
- effective dates
- current flag
๐น dbt Internal Components
- Custom Macros (SCD logic)
- Incremental + MERGE strategy
- Dependency Graph (ref)
๐น SQL Example (MERGE)
MERGE INTO dim_customers t
USING stg_customers s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND t.email <> s.email THEN
UPDATE SET is_current = FALSE, effective_end_date = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (...)
๐น Why required
-
Enables:
- historical reporting
- audit tracking
- slowly changing dimensions
๐น Git Role
-
Critical for:
- tracking SCD logic changes
- auditing business rule evolution
- Enables peer review for complex logic
STEP 5: DATA MART (GOLD LAYER)
๐น What happens
- Build fact and dimension tables
- Optimize for BI queries
๐น dbt Internal Components
- Ref Graph (DAG builder)
- Materializations (table/view)
- Query Planner
๐น SQL Example
๐น Why required
-
Serves:
- dashboards
- analytics
- Improves performance via denormalization
๐น Git Role
- Tracks business logic
- Maintains consistency across metrics
- Enables versioned analytics definitions
STEP 6: TESTING & VALIDATION
๐น What happens
-
Validate data quality:
- uniqueness
- null checks
- referential integrity
๐น dbt Internal Components
- Test Runner
- Schema YAML parser
- Assertion engine
๐น YAML Example
๐น Why required
- Prevents bad data propagation
- Ensures trust in analytics
๐น Git Role
- Tracks test coverage
- Enforces quality via CI/CD
-
Prevents bad merges
STEP 7: DOCUMENTATION & DEPLOYMENT
๐น What happens
- Generate lineage docs
- Deploy models via jobs
๐น dbt Internal Components
- Docs Generator
- Manifest.json
- Run Artifacts
- DAG Renderer
๐น Commands
๐น Why required
-
Provides:
- lineage graph
- column-level documentation
- Helps onboarding & governance
๐น Git Role
-
Stores:
- documentation YAML
- Enables:
- versioned lineage
- auditability
Internal Flow inside dbt:
-
Parser
- Reads SQL + YAML
-
Jinja/Macro Engine
- Expands dynamic logic
-
Compiler
- Converts to raw SQL
-
Adapter (Snowflake)
- Optimizes SQL
-
Execution Engine
- Runs queries
-
Test Engine
- Validates output
- Docs Generator
- Builds lineage + metadata
๐ Key Takeaways
- dbt is a compiler + orchestrator, not ingestion tool
- SQL is the primary transformation layer
- Python is optional (advanced models)
-
Git enables:
- collaboration
- versioning
- CI/CD
- Each step builds trust, performance, and scalability
dbt internal components : dbt model using SQL or Python
This diagram presents a high-level architectural blueprint of how dbt Core acts as the orchestration and transformation layer between raw source data and a formalized data warehouse. While the diagrams make it look like data "flows through" dbt, the fundamental takeaway for an architect is that dbt only handles the logic and orchestration (the control plane); the data processing (the data plane) is entirely pushed down to the target warehouse.
1. Source Data (Raw)
The Ingestion Layer: This represents the raw, "bronze-layer" data that has been landed in your cloud data warehouse (e.g., Snowflake, BigQuery) by your ELT tools (like Fivetran, Airbyte, or custom Python scripts).
- SME Insight: From a dbt perspective, these are entirely external tables that dbt does not manage. We only interact with them through
{{ source() }}macros, which allow us to define their location and properties (like freshness) in YAML. - Architectural Role: It is the foundational substrate for all downstream transformation. The data must be accessible by the dbt adapter’s credentials.
2. dbt Project & Model Definitions (.sql & .py)
The Blueprint: This is the developer's workspace—the codebase managed in Git. It contains all the instructions for how the data should be modeled.
SQL Model: The primary declarative method. It uses Select Statements combined with Jinja templating to create modular, reusable logic.
config()macros at the top define how the database object will be materialized (as atable,view, etc.).Python Model (SME Note): Introduced more recently, these models allow data teams to leverage Python libraries (like Pandas or scikit-learn) for operations that are impossible or highly complex in SQL (like machine learning inference or advanced statistical transformations). They are typically executed in the target warehouse's native Python environment (e.g., Snowflake Snowpark or BigQuery’s pandas API).
Principal Engineer View: The dbt project is where the principles of Software Engineering (version control, modularity, dry code) are applied to Data Engineering.
3. Manifest/Parser
The Project Cataloger: The Manifest/Parser is the "brain" that reads the entire project. It scans every
.sql,.py, and.ymlfile in your repository to build an in-memory inventory of your models, tests, seeds, sources, and snapshots.Dependency Resolution (DAG): Its critical output is the creation of the Directed Acyclic Graph (DAG). By reading the
{{ ref() }}and{{ source() }}macros, it determines the order in which models must be built. Ifmodel_Brefsmodel_A, it knows to buildmodel_Afirst.Validation: The parser validates the syntax, project structure, and config settings to ensure the project is "runnable." It’s the gatekeeper. If the parser finds an error, dbt stops.
4. Jinja Renderer
The Dynamic Engine: The parser identified where
{{ ref() }}was used, but the Jinja Renderer is what actually compiles it. This is where the magic of "declarative SQL" comes from.Environment Context: The renderer replaces abstract references like
{{ ref('customers') }}with the fully qualified, environment-specific database paths (e.g.,prod_db.core.customersin production, ordev_db.alice_schema.customersin development). This single feature makes code-switching between environments seamless.Logic Execution: It processes
{% if %},{% for %}, and custom macros to dynamically generate different SQL or configurations based on input variables or environment settings.
5. Compilation Engine
The Transpiler: Now that Jinja has replaced the abstract terms, the Compilation Engine takes over. Its job is to generate the final, pure-dialect SQL that is ready to be executed.
Wrapping the Logic: This engine takes your
SELECTquery and wraps it in the necessary DDL/DML, depending on the materialization setting. For amaterialized='table', it will wrap yourSELECTin aCREATE OR REPLACE TABLE AS.... For an incremental model, it will generate a complexMERGEorINSERTstatement based on the logic you defined.Adapter Awareness: The compilation engine uses the Target DB Adapters (see below) to know exactly what SQL dialect (Snowflake, BigQuery, Postgres) to generate.
6. Artifacts Generator
The Persistent Memory: As dbt runs, it saves its state and knowledge into two main files:
manifest.json: A full, JSON-readable dump of everything the Manifest/Parser found. This is dbt’s "project-at-rest" state. It is used downstream by other tools and for State-Based Execution, allowing dbt to run only the models that have changed since the last run.catalog.json: A JSON file generated after adbt docs generatecommand. It contains the schema, column definitions, and data types from the target warehouse for all managed models.
Metadata Foundation: These artifacts are essential for dbt’s automated documentation website and for integration with modern data catalogs.
7. Execution Engine (dbt-core)
The Conductor: The Execution Engine is the final orchestrator. It doesn't perform data work itself, but it knows when and where that work should happen.
Job Management: It takes the generated SQL/Python from the Compilation Engine and dispatches it to the cloud data warehouse via the established database connection. It manages the threads (concurrency), sending parallel queries to the warehouse when the DAG allows.
SME Insight: The Execution Engine handles the lifecycle of the job. It establishes the connection, starts the query, waits for the result, records performance metrics (start time, end time, status), and handles the commit/rollback logic on completion or error. It is the component that makes dbt a full-fledged deployment coordinator.
8. Target DB Adapters (Snowflake, BigQuery, etc.)
The Translation Layer: This is the abstraction layer that allows dbt Core to remain dialect-agnostic. The adapters (e.g.,
dbt-snowflake,dbt-bigquery) translate abstract dbt commands into the specific API calls and SQL dialects required by that database.Database-Specific Operations: The adapter knows that a table creation statement in Snowflake is different from BigQuery. It handles the nuances of connection authentication, data types, and transactional control.
9. Transformed Data Models & Transformation Execution
The Destination: This is the finalized, modeled data that is now ready for analytics (the "gold-layer").
Pushed-Down Execution: This is the single most critical concept. All data movement and processing occur entirely inside the Target Cloud Data Warehouse.
SQL models are executed as
CREATE TABLE AS...orINSERT/MERGESQL commands directly against the warehouse engine.Python models are packaged and sent to the warehouse’s internal runtime (like Snowpark UDFs) where they execute in specialized, serverless compute containers close to the data.
Output: The result is high-quality, trusted tables like
dim_customersandfct_orderswhich are query-ready for your BI tools.
Wednesday, March 25, 2026
dbt_codegen package
What is dbt_codegen (Quick Context)
dbt_codegen is a macro package that auto-generates repetitive dbt code (SQL + YAML) by introspecting your warehouse metadata
๐ It reduces:
- Manual boilerplate
- Human error in YAML/docs/tests
- Time-to-model for new datasets
Core Use Cases (with Architecture Thinking)
1) ๐งฑ Generate Source YAML from Raw Tables
Why it matters
- Standardizes ingestion layer
- Enables lineage + freshness + testing
Architect Insight
- Use this when onboarding new schemas from ingestion tools (Fivetran, Airbyte)
- Combine with:
- freshness tests
- source-level SLAs
2) Generate Base Models (Staging Layer)
Why it matters
- Enforces consistent staging patterns
- Accelerates Medallion architecture (Bronze → Silver)
-
Pair with naming convention:
stg_<source>__<table> -
Add:
- column casting
- null handling
- deduplication logic
- Ensures every model is documented and testable
- Critical for governance + data contracts
Architect Insight
-
Use this to enforce:
- Data quality SLAs
- Contract-first modeling
- Extend with:
-
accepted_values -
relationships
4) Generate Column Lists Dynamically
Why it matters
-
Avoids
SELECT * - Prevents breakage when schema evolves
Architect Insight
-
Combine with:
- incremental models
- snapshot logic
- Helps with schema drift handling
Why it matters
- Accelerates data quality adoption
Architect Insight
- Use for:
- onboarding new teams
- enforcing platform-wide testing standards
6) ๐ Automate Documentation at Scale
Why it matters
- Documentation is often neglected → dbt_codegen fixes that
Example Workflow
- Run macro
-
Paste into
.yml - Add descriptions later
Architect Insight
- Integrate into CI/CD:
- auto-generate → PR → review
7) ๐งฉ Metadata-Driven Development
Why it matters
- Moves toward declarative analytics engineering
Example Pattern
- Introspect schema → generate models → enrich logic
Architect Insight
-
Combine with:
- information_schema queries
- data catalogs
- AI-assisted model generation
๐ท How dbt_codegen Works in dbt Cloud (Internals)
Step-by-step Flow
1) Metadata Introspection
-
Queries warehouse system tables
(information_schema.columns)
2) Macro Execution
-
Jinja macros build:
- SQL
- YAML
- Tests
3) Output Rendering
-
Output appears in:
- dbt Cloud IDE
- CLI logs
4) Manual/Automated Integration
- Developer copies OR pipelines inject into repo
End-to-End dbt ecosystem
This diagram visualizes the complete end-to-end dbt ecosystem incorporating all the technologies we discussed (Snowflake, Databricks, Fabric, Python models, Airflow, and advanced CI/CD).
The flow moves from left (Sources) to right (Marts and Delivery). Here is a detailed breakdown of each architectural stage and the data flow.
Data Sources & The Staging Layer (The DLH/W)
This is the intake and initial storage area. Raw data arrives from operational systems via Airflow-orchestrated ingestion tasks (using Fivetran/Airbyte) and lands in the unified Data Lakehouse/Warehouse (marked as a large cube supporting Snowflake Horizon, Databricks Unity, and MS Fabric OneLake).
The diagram illustrates two critical concepts we’ve covered:
Layered Storage: The warehouse is partitioned into a Bronze/Raw zone (raw data) and a Silver/Staging zone (initial dbt
stg_clean-up models).The Codegen Loop: A specialized codegen operation is shown running outside the main DAG execution. It queries the data warehouse INFORMATION_SCHEMA (the process we visualized in
image_0.png) and automatically generates the YAML source definitions and SQL base models. This workflow accelerates development time and reduces manual errors when onboarding new datasets.
The Transformation & Orchestration Layer (The Core DAG)
This is the heart of the modern dbt solution. The execution is segmented into the internal steps crucial for production resilience.
The Internal dbt Lifecycle:
Model Parsing: Dbt-core (or Cloud) builds the dependency graph (DAG) by interpreting all SQL
ref()macros.Compile & Manifest: The
manifest.jsonis generated, which is the artifact required by Airflow Cosmos for dynamic task generation.Model Execution (dbt Build): The DAG is executed against the warehouse. The diagram explicitly displays polyglot engineering (SQL mixed with Python):
SQL Path: The execution of standard relational models (
int_campaign_performanceflowing intofct_ad_performance) which are pushed down to the warehouse engine.Python Path: The execution of a complex model (
py_ml_customer_segmentation). This triggers a distinct Snowpark Python Call (in Snowflake) or a PySpark Call (in Databricks/Fabric), processing the statistics and saving the result.
Test & Governance: Once built, data quality assertions are run (dbt-expectations) and YAML constraints (Schema Contracts) are enforced.
Orchestration & Governance Sync:
Airflow DAG: The Airflow orchestration (via Astronomer Cosmos provider) maps directly to the dbt steps: Parse -> Compile -> Execute (Dynamic Task).
Catalog Sync: Successful dbt builds are synchronized with the respective warehouse governance tools, such as Databricks Unity Catalog, ensuring fine-grained access control is applied to the newly created data marts.
Production & Delivery
This column represents the finalized, production-grade state. The built, tested, and governed models are available in the Gold/Marts presentation schema. The data is now available to consumption engines for different business use cases: BI Dashboards, ML models (for inference), and Operational Sync (Reverse ETL to push data back to Salesforce, etc.).
Advanced Platform Engineering: CI/CD Pipeline (Bottom Right)
This flow details how a Principal Engineer ensures data platform stability:
Commit Trigger: A developer pushes code (image_1.png references).
Slim CI Build: The pipeline runs only the modified models and their downstream dependents. It uses the
dbt Build --defer --stateflag to "defer" to the production schema for unmodified upstream tables, saving execution cost.Blue/Green Deployment: For the final release, a Zero-Copy Clone (Snowflake specific) is created to instantly swap production and staging environments, allowing zero-downtime, safe rollbacks.
