Introduction
SQLMesh is a next-generation data transformation framework that offers virtual data environments, column-level lineage, automatic data contracts, and built-in CI/CD. This tutorial will walk you through using SQLMesh with the VS Code extension and the sushi example project.
What You’ll Learn
How to set up SQLMesh with VS Code
Understanding the sushi example project
Using the VS Code extension features
Creating and running SQLMesh plans
Deploying to production
Prerequisites
Before starting, ensure you have:
Python 3.8 or higher installed
VS Code installed
Git installed
Basic understanding of SQL and data modeling
System Requirements
macOS, Linux, or Windows
At least 4GB of RAM
2GB of free disk space
Setting Up the Environment
Step 1: Create a Project Directory
# Create a directory for the tutorial
mkdir sqlmesh-tutorial
cd sqlmesh-tutorial
Step 2: Set Up Python Virtual Environment
# Create virtual environment
python3 -m venv .venv
# Activate virtual environment
# On macOS/Linux:
source .venv/bin/activate# On Windows:
venv\Scripts\Activate
Step 3: Install SQLMesh with LSP Support
# Install SQLMesh with VS Code Language Server Protocol support
pip install 'sqlmesh[lsp]'
# Verify installation
sqlmesh --version
Installing the VS Code Extension
Step 1: Install from VS Code Marketplace
Open VS Code
Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
Search for “SQLMesh”
Click “Install” on the official SQLMesh extension by TobikoData
Step 2: Configure Python Interpreter
Open Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
Type “Python: Select Interpreter”
Choose the Python interpreter from your virtual environment (.venv)
Step 3: Verify Extension Installation
The SQLMesh icon should appear in the Activity Bar on the left side of VS Code.
Setting Up the Sushi Example Project
Step 1: Clone the SQLMesh Examples Repository
# Clone the examples repository
git clone https://github.com/TobikoData/sqlmesh-examples.git
# Navigate to the sushi moderate example
cd sqlmesh-examples/001_sushi/2_moderate
Step 2: Install Additional Dependencies
# Install Jupyter for notebooks (optional)
pip install jupyter
Step 3: The Project
Understanding the Sushi Project Structure
Project Overview
The sushi example simulates a fictional sushi restaurant’s data pipeline with:
Entities: Waiters, Customers, Orders
Transactions: Order details and items
Analytics: Revenue calculations and top performers
Directory Structure
Key Models
Source Models (from raw schema):
Transformation Models:
Working with the VS Code Extension
Feature 1: Interactive Lineage Visualization
Open any model file (e.g., models/top_waiters.sql)
Click on the “Lineage” tab in the bottom panel
Explore the interactive lineage graph: Click nodes to see model details, Trace dependencies upstream and downstream View column-level lineage
Feature 2: Intelligent Code Assistance
When editing models, the extension provides:
Auto-completion: Press Ctrl+Space for model names and SQLMesh keywords
Tooltips: Descriptions, linter, and render
Go-to-definition: Ctrl+Click on model names to navigate
Feature 3: Real-time Error Detection
The extension validates your models in real-time:
Syntax errors appear with red squiggles
Missing dependencies are highlighted
Invalid configurations are flagged
Feature 4: Format on Save
Enable automatic formatting:
Go to VS Code Settings
Search for “Format on Save”
Enable the option
SQLMesh models will auto-format when saved
Creating and Modifying Models
Understanding Model Structure
Let’s examine a model from the sushi project: top_waiters.sql
/*
View of 10 waiters with highest revenue on most recent day of data.
*/
MODEL (
name sushimoderate.top_waiters,
kind VIEW,
cron '@daily',
grain waiter_id,
audits (
unique_values(columns=[waiter_id])
),
);
SELECT
waiter_id::INT AS waiter_id, /* Waiter ID */
name::TEXT AS waiter_name, /* Waiter name */
revenue::DOUBLE AS revenue /* Waiter revenue on most recent day of data */
FROM sushimoderate.waiter_revenue_by_day as r
LEFT JOIN sushimoderate.waiter_names AS n
ON r.waiter_id = n.id
WHERE
ds = (
SELECT
MAX(ds)
FROM sushimoderate.waiter_revenue_by_day
)
ORDER BY
revenue DESC
LIMIT 10;
Model Components
MODEL block: Defines metadata
SELECT statement: The transformation logic
Creating a New Model
Let’s create a new model for analyzing customer ordering patterns: customer_order_patterns.sql
-- models/customer_order_patterns.sql
MODEL (
name sushimoderate.customer_order_patterns,
kind VIEW,
cron '@daily',
grain 'customer_id',
description 'Customer ordering patterns and preferences'
);
SELECT
o.customer_id::INT,
COUNT(DISTINCT o.id)::INT AS total_orders,
COUNT(DISTINCT oi.item_id)::INT AS unique_items_ordered,
SUM(oi.quantity)::INT AS total_items_quantity,
AVG(oi.quantity)::DOUBLE AS avg_items_per_order,
MAX(o.ds)::DATE AS last_order_date
FROM sushimoderate.orders AS o
JOIN sushimoderate.order_items AS oi
ON o.id = oi.order_id
LEFT JOIN sushimoderate.customers AS c
ON o.customer_id = c.customer_id
GROUP BY o.customer_id;
Adding Audits
Add data quality checks to your model. Create an ‘audits’ folder inside the 2_moderate project, and add a mustomeer_moidel_audit.sql file with the following:
AUDIT (
name does_not_exceed_threshold
);
SELECT * FROM @this_model
WHERE @column >= @threshold;
Running Plans and Deployments
Understanding Environments
SQLMesh uses virtual data environments:
dev: Development environment for testing changes
prod: Production environment with validated data
Step 1: Create a Development Plan
sqlmesh plan dev
Step 2: Apply the Plan
Type y to apply the changes. SQLMesh will:
Create the new model
Backfill historical data
Update the lineage
Step 3: Validate the Results
# Query the new model
sqlmesh fetchdf "SELECT * FROM sushimoderate__dev.customer_order_patterns LIMIT 5"
# Run tests
sqlmesh test
Step 4: Deploy to Production
# Create a production plan
sqlmesh plan
# Review changes
# Apply when ready
Advanced Features
1. Python Models
Create models using Python for complex logic:
# models/customer_segments.py
from sqlmesh import model
import pandas as pd
import typing as t # It's good practice to import typing for type hints
@model(
"sushimoderate.customer_segments",
kind="FULL", # FULL kind means it rebuilds every time
cron="@daily",
columns={
"customer_id": "INT",
"segment": "TEXT",
"revenue_percentile": "DOUBLE"
}
)
def execute(context, start, end, execution_time, **kwargs: t.Any) -> pd.DataFrame: # Added type hint for kwargs and return
# Read customer revenue data
df = context.fetchdf(f"""
SELECT
customer_id,
SUM(revenue) as total_revenue
FROM sushimoderate.customer_revenue_by_day
-- For FULL models, you might not need to filter by start/end here
-- unless the upstream model's data is inherently time-partitioned
-- and you only want to process a full snapshot based on its entirety.
-- If sushimoderate.customer_revenue_by_day is incremental,
-- this query will fetch ALL data from it, which is typical for a FULL downstream model.
GROUP BY customer_id
""") if df.empty:
# If the DataFrame is empty (e.g., no revenue data from upstream),
# yield from an empty generator as per SQLMesh best practices.
yield from ()
return # Explicitly return after yielding from empty generator # Calculate percentiles
# Ensure total_revenue is not all NaNs or empty before ranking
if df['total_revenue'].notna().any():
df['revenue_percentile'] = df['total_revenue'].rank(pct=True)
else:
# Handle case where total_revenue might be all NaNs after sum if source revenue is all NaNs
df['revenue_percentile'] = pd.NA # Or 0.0, or handle as appropriate # Assign segments
# Ensure revenue_percentile exists and is not all NA before pd.cut
if 'revenue_percentile' in df and df['revenue_percentile'].notna().any():
df['segment'] = pd.cut(
df['revenue_percentile'],
bins=[0, 0.25, 0.75, 1.0],
labels=['Low', 'Medium', 'High'],
include_lowest=True # Important for including the 0th percentile
)
else:
df['segment'] = pd.NA # Or a default segment # It's good practice to select and reorder columns explicitly
# to match the 'columns' definition in the @model decorator.
# Also, ensure all defined columns are present.
final_df = df[['customer_id', 'segment', 'revenue_percentile']] yield final_df
2. Incremental Models
Handle large datasets efficiently. Create a new SQL model file ‘daily_order_summary.sql’:
MODEL (
name sushimoderate.daily_order_summary,
kind INCREMENTAL_BY_TIME_RANGE (
time_column ds -- Specifies 'ds' as the time column for incremental processing
),
start '2023-01-01',
cron '@daily',
grain (ds, waiter_id), -- The grain of the output table
description 'Daily order summary by waiter, including order count and total revenue.'
);
SELECT
o.ds::DATE AS ds, -- The date of the summary
o.waiter_id::INT AS waiter_id, -- The waiter ID
COUNT(DISTINCT o.id)::INT AS order_count, -- Count of unique orders handled by the waiter on that day
SUM(oi.quantity * i.price)::DOUBLE AS total_revenue -- Total revenue generated from those orders
FROM raw.orders AS o
JOIN raw.order_items AS oi
ON o.id = oi.order_id AND o.ds = oi.ds -- Join orders with order_items on order ID and date
JOIN raw.items AS i
ON oi.item_id = i.id AND oi.ds = i.ds -- Join order_items with items on item ID and date to get the price for that day
WHERE
o.ds BETWEEN @start_ds AND @end_ds -- Filter for the incremental time range
GROUP BY
o.ds,
o.waiter_id;
3. Macros
Create reusable SQL functions:
# macros/date_utils.py
from sqlmesh import macro
@macro()
def fiscal_quarter(evaluator, date_col):
"""Convert date to fiscal quarter (Q1 starts in April)"""
return f"CASE WHEN MONTH({date_col}) >= 4 THEN 'Q' || ((MONTH({date_col}) - 4) / 3 + 1) ELSE 'Q' || ((MONTH({date_col}) + 8) / 3 + 1) END"
Use in models:
SELECT
@fiscal_quarter(ds) AS fiscal_quarter,
SUM(revenue) AS quarterly_revenue
FROM raw.orders
GROUP BY fiscal_quarter;
4. Unit Tests
Create comprehensive tests:
# tests/test_customer_order_patterns.yaml
test_customer_order_patterns:
model: sushi.customer_order_patterns
inputs:
raw.orders:
- order_id: 1
customer_id: 101
waiter_id: 1
ds: '2023-01-01'
- order_id: 2
customer_id: 101
waiter_id: 2
ds: '2023-01-02'
raw.order_items:
- order_id: 1
item_id: 1
quantity: 2
- order_id: 2
item_id: 2
quantity: 1
outputs:
query:
- customer_id: 101
total_orders: 2
unique_items_ordered: 2
total_items_quantity: 3
avg_items_per_order: 1.5
last_order_date: '2023-01-02'
Best Practices
1. Model Organization
Use clear, descriptive model names
Group related models in subdirectories
Follow a consistent naming convention
2. Documentation
Add descriptions to all models
Document column purposes using column_descriptions
Include business logic explanations in comments
3. Testing Strategy
Write unit tests for critical models
Use audits for data quality checks
Test edge cases and null handling
4. Performance Optimization
Use appropriate model kinds (VIEW vs FULL vs INCREMENTAL)
Add indexes for frequently queried columns
Monitor query performance in production
5. Version Control
Commit config.yaml and all model files
Use meaningful commit messages
Review changes before deploying to production
6. Development Workflow
Create a feature branch
Develop models in VS Code
Run sqlmesh plan dev to test
Validate results with sqlmesh fetchdf
Run tests with sqlmesh test
Create a pull request
Deploy to production after review
Troubleshooting
Common Issues
Extension not detecting models:
Lineage not displaying:
Plan failures:
Getting Help
SQLMesh Documentation:
https://sqlmesh.readthedocs.io/
Summary
You’ve now learned how to:
Set up SQLMesh with the VS Code extension
Work with the sushi example project
Create and modify models
Use advanced features like Python models and macros
Deploy changes safely using virtual environments
SQLMesh’s powerful features, like virtual data environments and column-level lineage, make it an excellent choice for modern data transformation workflows. Continue exploring the documentation and examples to master more advanced features!
Head of Education & Evangelism at Tobiko
Founder, Insights x Design