Snowflake Lakehouse

A Snowflake Lakehouse implementation with AWS and Infrastructure as Code (Terraform), automated deployment using GitHub Actions.

Overview

This repository tracks the build of an end-to-end Snowflake data engineering solution—from source data analysis and ingestion design to layered stage/raw/curated modeling, automation with DAG + GitHub Actions, dynamic tables, and Streamlit dashboards—using Snowpark Python and marketplace datasets.

The project demonstrates a complete data lakehouse implementation with:

Infrastructure as Code: Terraform configurations for AWS (S3, IAM) and Snowflake resources
Layered Data Architecture: Stage → Raw → Curated data modeling pattern
Automated Ingestion: Snowpipe for real-time data loading from S3
Data Transformation: Snowpark Python for ETL/ELT processing
Orchestration: DAG-based workflows with GitHub Actions CI/CD
Dynamic Tables: Incremental data processing with automatic refresh
Visualization: Streamlit dashboards for data exploration
Marketplace Integration: Leveraging Snowflake marketplace datasets

Repository Structure

.
├── infra/                          # Infrastructure as Code (Terraform)
│   ├── platform/tf/                # Root orchestration module (entry point)
│   │   ├── main.tf                 # Orchestrates AWS + Snowflake modules
│   │   ├── locals.tf               # Configuration parsing from JSON
│   │   ├── variables.tf            # Input variables
│   │   ├── outputs.tf              # Module outputs
│   │   ├── versions.tf             # Provider version constraints
│   │   ├── backend.tf              # Terraform Cloud backend
│   │   ├── providers-aws.tf        # AWS provider configuration
│   │   └── providers-snowflake.tf  # Snowflake provider configuration
│   ├── aws/tf/                     # AWS child module
│   │   ├── main.tf                 # S3 bucket + IAM role orchestration
│   │   ├── modules/                # Nested modules (s3, iam, iam_role_final, s3_event_notification)
│   │   └── templates/              # Bucket policy templates
│   └── snowflake/tf/               # Snowflake child module
│       ├── main.tf                 # Warehouses, databases, stages, pipes
│       └── modules/                # Nested modules (warehouse, database, stage, etc.)
├── input-jsons/                    # Configuration files
│   ├── aws/config.json             # AWS resource configuration
│   └── snowflake/config.json       # Snowflake resource configuration
├── snowflake-ddl/                  # Snowflake DDL Scripts
│   ├── 00_account/                 # Account-level objects (resource monitors, network policies)
│   ├── 01_security/                # Roles, users, grants
│   ├── 02_warehouses/              # Virtual warehouses
│   ├── 03_databases/               # Database definitions
│   ├── 04_storage/                 # Storage integrations & external stages
│   ├── 05_schemas/                 # Schema-level objects (tables, views)
│   ├── 06_pipes/                   # Snowpipe definitions
│   ├── 07_tasks/                   # Task definitions
│   ├── 08_functions/               # UDFs and UDTFs
│   ├── 09_procedures/              # Stored procedures
│   ├── environments/               # Environment configs (dev/staging/prod)
│   └── scripts/                    # Utility scripts (deploy, rollback, validate)
├── .github/
│   └── workflows/                  # GitHub Actions workflows (ci.yaml, etc.)
├── .devcontainer/                  # Dev container configuration
├── cliff.toml                      # git-cliff changelog configuration
└── utils/                          # Utility scripts

Architecture

This project uses a 4-phase architecture:

Phase 1: AWS Resources

S3 Bucket for data storage
IAM Role with placeholder trust policy

Phase 2: Snowflake Resources

Warehouses, Databases, Schemas
File Formats, Storage Integration
External Stages, Tables, Snowpipes

Phase 3: AWS Trust Policy Update

Update IAM Role trust policy with Snowflake's IAM User ARN and External ID

Phase 4: S3 Event Notifications

Configure S3 bucket notifications to trigger Snowpipe auto-ingest

Getting Started

Prerequisites

Terraform >= 1.0
Snowflake Account with appropriate permissions
AWS Account with IAM permissions
GitHub Repository with Actions enabled

One-Time Snowflake Setup

Before using this action, run the following SQL script in Snowflake to create the utility infrastructure (only needs to be run once):

Step 1: Create Utility Infrastructure

-- =========================================================
-- Snowflake Utility Setup for DDL Migrations
-- =========================================================
-- This script creates:
--   1. A dedicated warehouse for CI/CD metadata operations
--   2. Utility database (UTIL_DB)
--   3. Utility schema (UTIL_SCHEMA)
--   4. DDL migration history table
--
-- Safe to re-run (idempotent)
-- =========================================================

-- -----------------------------------------------------------
-- 1. Create and use a dedicated warehouse
-- -----------------------------------------------------------
CREATE WAREHOUSE IF NOT EXISTS UTIL_WH
  WAREHOUSE_SIZE = 'XSMALL'
  WAREHOUSE_TYPE = 'STANDARD'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE
  COMMENT = 'Warehouse for CI/CD utility operations and DDL migration tracking';

USE WAREHOUSE UTIL_WH;

-- -----------------------------------------------------------
-- 2. Create utility database and schema
-- -----------------------------------------------------------
CREATE DATABASE IF NOT EXISTS UTIL_DB
  COMMENT = 'Utility database for CI/CD metadata and migration tracking';

CREATE SCHEMA IF NOT EXISTS UTIL_DB.UTIL_SCHEMA
  COMMENT = 'Utility schema for migration and operational tables';

-- -----------------------------------------------------------
-- 3. Create DDL migration history table
-- -----------------------------------------------------------
CREATE TABLE IF NOT EXISTS UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY (
  script_name    STRING        NOT NULL,
  script_path    STRING        NOT NULL,
  checksum       STRING        NOT NULL,
  applied_at     TIMESTAMP_LTZ NOT NULL DEFAULT CURRENT_TIMESTAMP(),
  status         STRING        NOT NULL,
  error_message  STRING,
  run_id         STRING,
  actor          STRING
) COMMENT = 'Tracks executed Snowflake DDL migration scripts for CI/CD pipelines';

-- -----------------------------------------------------------
-- 4. (Optional) Verify creation
-- -----------------------------------------------------------
SELECT
  'UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY created successfully' AS status,
  CURRENT_TIMESTAMP() AS verified_at;

Step 2: Grant MANAGE GRANTS Privilege to SYSADMIN

SYSADMIN needs the MANAGE GRANTS privilege to grant permissions to other roles like PUBLIC. Run this as ACCOUNTADMIN:

USE ROLE ACCOUNTADMIN;

-- Grant MANAGE GRANTS privilege to SYSADMIN
-- This allows SYSADMIN to grant privileges on objects it owns
GRANT MANAGE GRANTS ON ACCOUNT TO ROLE SYSADMIN;

-- Verify the grant
SHOW GRANTS TO ROLE SYSADMIN;

Note: With this setup, SYSADMIN can both create objects and manage their permissions, simplifying the deployment process.

Note: If you want to use a different database/schema/table name, you can customize it using the migrations_table input parameter in the GitHub Actions workflow.

1. Create Dedicated Service Account

For security best practices, create a dedicated service account for GitHub Actions instead of using your personal account.

Step 1: Generate Key Pair

On your local machine, generate an RSA key pair:

Option A: Without Passphrase (Recommended for CI/CD)

# Generate unencrypted PKCS8 private key (no passphrase)
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_key.p8 -nocrypt

# Generate public key
openssl rsa -in snowflake_key.p8 -pubout -out snowflake_key.pub

Option B: With Passphrase (For enhanced security)

# Generate encrypted PKCS8 private key (with passphrase)
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_key.p8 -v2 aes-256-cbc

# Generate public key
openssl rsa -in snowflake_key.p8 -pubout -out snowflake_key.pub

Extract public key value (for both options):

# Remove header/footer and newlines for Snowflake
grep -v "BEGIN PUBLIC" snowflake_key.pub | grep -v "END PUBLIC" | tr -d '\n'

Save the output - you'll need it for the next step.

Note: If using a passphrase, you'll need to provide SNOWFLAKE_PRIVATE_KEY_PASSPHRASE as an additional secret.

Step 2: Create Service Account in Snowflake

Run this SQL in Snowflake (replace YOUR_PUBLIC_KEY_HERE with the output from Step 1):

-- =========================================================
-- Create Service Account for GitHub Actions
-- =========================================================

-- Create dedicated service account
CREATE USER IF NOT EXISTS GH_ACTIONS_USER
  RSA_PUBLIC_KEY = 'YOUR_PUBLIC_KEY_HERE'
  DEFAULT_ROLE = SYSADMIN
  DEFAULT_WAREHOUSE = COMPUTE_WH
  MUST_CHANGE_PASSWORD = FALSE
  COMMENT = 'Service account for GitHub Actions CI/CD deployments';

-- Grant SYSADMIN role (for DDL and grant operations)
GRANT ROLE SYSADMIN TO USER GH_ACTIONS_USER;

-- Grant usage on warehouses
GRANT USAGE ON WAREHOUSE UTIL_WH TO ROLE SYSADMIN;
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE SYSADMIN;

-- Grant usage on the utility database
GRANT USAGE ON DATABASE UTIL_DB TO ROLE SYSADMIN;
GRANT USAGE ON SCHEMA UTIL_DB.UTIL_SCHEMA TO ROLE SYSADMIN;

-- Grant create privileges for the migration table
GRANT CREATE TABLE ON SCHEMA UTIL_DB.UTIL_SCHEMA TO ROLE SYSADMIN;

-- Grant all privileges on the migration table (if it already exists)
GRANT ALL PRIVILEGES ON TABLE UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY TO ROLE SYSADMIN;

-- If the user needs to create the database/schema (first run)
GRANT CREATE DATABASE ON ACCOUNT TO ROLE SYSADMIN;

-- Verify the user's role
DESC USER GH_ACTIONS_USER;

-- See what roles the user has
SHOW GRANTS TO USER GH_ACTIONS_USER;

-- See what the SYSADMIN role can do
SHOW GRANTS TO ROLE SYSADMIN;

Security Notes:

✅ Use SYSADMIN role for all DDL and grant operations
✅ Grant MANAGE GRANTS privilege to SYSADMIN for permission management
✅ Key-pair authentication is more secure than passwords
✅ Service accounts provide better audit trails
✅ Never commit private keys to the repository

2. Configure GitHub Secrets and Variables

Set up GitHub Actions authentication. Navigate to Settings → Secrets and variables → Actions.

Required Repository Variables

Variable Name	Description	Example
`SNOWFLAKE_ORGANIZATION_NAME`	Snowflake organization name	`AGXUOKJ`
`SNOWFLAKE_ACCOUNT_NAME`	Snowflake account name	`JKC15404`
`SNOWFLAKE_USER`	Service account username	`GH_ACTIONS_USER`
`SNOWFLAKE_ROLE`	Snowflake role for deployments	`SYSADMIN`
`AWS_REGION`	AWS region for resources	`us-east-1`
`TF_LINT_VER`	TFLint version (optional)	`v0.50.0`

Required Repository Secrets

Secret Name	Description
`SNOWFLAKE_PRIVATE_KEY`	Content of `snowflake_key.p8` file (including `-----BEGIN/END PRIVATE KEY-----` headers)
`TF_TOKEN_APP_TERRAFORM_IO`	Terraform Cloud API token for remote backend
`AWS_OIDC_ROLE_ARN`	AWS IAM role ARN for OIDC authentication (e.g., `arn:aws:iam::123456789012:role/github-actions-role`)

How to Get These Values

Snowflake Variables:

Log into Snowflake
Organization name: Found in your account URL (https://<org>-<account>.snowflakecomputing.com)
Account name: Same as above
User/Role: Created in the service account setup (Step 1)

Snowflake Private Key:

Generated in Step 1 (snowflake_key.p8)
Copy the entire file content including headers

Terraform Cloud Token:

Go to Terraform Cloud
Navigate to User Settings → Tokens
Create a new API token

AWS OIDC Role ARN:

Set up OIDC in AWS (see AWS OIDC Setup)
Copy the IAM role ARN

2a. Configure Codespaces Secrets (For Terraform Development)

If you're running Terraform from GitHub Codespaces, you need to configure Codespaces secrets for authentication.

Quick setup:

Navigate to: Settings → Secrets and variables → Codespaces

Add these secrets:

Snowflake Authentication:

Secret Name	Description
`TF_VAR_snowflake_organization_name`	Snowflake organization name
`TF_VAR_snowflake_account_name`	Snowflake account name
`TF_VAR_snowflake_user`	Snowflake username
`TF_VAR_snowflake_private_key`	Content of `snowflake_key.p8`
`TF_VAR_snowflake_role`	Set to `SYSADMIN`

AWS Authentication:

Secret Name	Description
`AWS_ACCESS_KEY_ID`	From AWS IAM
`AWS_SECRET_ACCESS_KEY`	From AWS IAM
`AWS_DEFAULT_REGION`	e.g., `us-east-1`

Note: GitHub Actions secrets and Codespaces secrets are stored separately. You need to configure both, but you can use the same values.

3. AWS OIDC Setup (Optional but Recommended)

For secure GitHub Actions authentication with AWS without long-lived credentials, set up OIDC (OpenID Connect). This eliminates the need to store AWS access keys in GitHub Secrets.

See detailed setup instructions: infra/aws/README.md

Benefits:

✅ No AWS access keys stored in GitHub Secrets
✅ Short-lived tokens that expire automatically
✅ Improved security posture
✅ Recommended by AWS and GitHub

Snowflake Object Organization

Scripts are organized by execution order:

00_account: Resource monitors, network policies
01_security: Roles, users, grants
02_warehouses: Virtual warehouses
03_databases: Database creation
04_storage: Storage integrations and external stages
05_schemas: Tables, views, streams
06_pipes: Snowpipe for automated ingestion
07_tasks: Scheduled tasks
08_functions: User-defined functions
09_procedures: Stored procedures

Sample Implementation

The repository includes sample implementations:

Warehouse: COMPUTE_WH (small, auto-suspend)
Database: RAW_DB with sales, marketing, finance schemas
Tables:
- customer_orders - Order transactions
- customer_master - Customer data
- product_catalog - Product information

GitHub Actions Workflow

The deployment workflow (snowflake-deploy.yaml) automatically:

Discovers all SQL files in the repository
Deploys them in dependency order
Runs files in parallel within each stage
Uses the reusable action: subhamay-bhattacharyya-gha/snowflake-run-ddl-action

Triggers:

Push to main or develop branches (when snowflake/** files change)
Pull requests to main or develop
Manual workflow dispatch

Best Practices

Migration Tracking

By default, the action tracks which scripts have been applied using a migrations table. This enables:

Idempotent execution: Scripts are only run once (based on path + checksum)
Change detection: If a script's content changes, it will be re-run
Audit trail: Complete history of what was applied, when, and by whom

Migration Table Schema

The default table UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY contains:

script_name - Filename of the script
script_path - Full path to the script
checksum - SHA-256 hash of the script content
applied_at - Timestamp when applied
status - SUCCESS or FAILED
error_message - Error details if failed
run_id - GitHub Actions run ID
actor - GitHub user who triggered the run

Baseline Mode

Use baseline mode to mark existing scripts as applied without executing them. This is useful when:

Adopting this action in an existing environment
Scripts have already been manually applied
You want to start tracking from a known state

To enable baseline mode in the workflow:

- name: Deploy with baseline
  uses: subhamay-bhattacharyya-gha/snowflake-run-ddl-action@v1
  with:
    baseline: true
    # ... other parameters

Disabling Migration Tracking

To run scripts without tracking (not recommended for production):

- name: Deploy without tracking
  uses: subhamay-bhattacharyya-gha/snowflake-run-ddl-action@v1
  with:
    track_migrations: false
    # ... other parameters

SQL Scripts

Use CREATE OR REPLACE or CREATE IF NOT EXISTS for idempotency
Add meaningful comments to all objects
Number files for execution order (01_, 02_, etc.)
Test in dev before promoting to staging/prod

Security

Never commit credentials or private keys
Use service accounts for automation
Implement least privilege access
Rotate keys regularly

Infrastructure

Use remote state storage for Terraform
Enable state locking
Tag all resources consistently
Use separate environments (dev/staging/prod)

Documentation

Contributing

Commit Message Convention

This project uses Conventional Commits for automated changelog generation. Please format your commit messages as follows:

<type>: <description>

[optional body]

Commit Types

Type	Description	Example
`feat`	New feature or functionality	`feat: add Azure storage integration support`
`fix`	Bug fix	`fix: correct IAM trust policy condition`
`docs`	Documentation changes	`docs: update README with setup instructions`
`style`	Code style changes (formatting, whitespace)	`stylhttps://agxuokj-jkc15404.snowflakecomputing.com/console/login?activationToken=ver%3A1-hint%3A344489740-ETMsDgAAAZuzoPggABRBRVMvQ0JDL1BLQ1M1UGFkZGluZwEAABAAEBldmu8VANRBCTUgQE%2F7RGgAAABg%2Bi1xEnXGEcqx%2BVMauNO9GmzhCnHTRbWhExX%2Ftsk%2BfZHPKbTjNV61u9%2B%2BjuAiPOgpm%2FYk6MsqkwrbcUM5%2F9LYDHnEoUuMjYN5A7MZDQWpWfx2y6ERIZO3Uq1CuKFbCZbEABTZyEHS0WcfOoqbc3Dw6%2FyEs1zyow%3D%3De: fix indentation in main.tf`
`refactor`	Code refactoring without feature changes	`refactor: simplify locals.tf configuration`
`perf`	Performance improvements	`perf: optimize S3 bucket policy lookup`
`test`	Adding or updating tests	`test: add validation for warehouse config`
`chore`	Maintenance tasks, dependencies	`chore: update Terraform provider versions`
`ci`	CI/CD configuration changes	`ci: add changelog generation to workflow`

Examples

# Feature
git commit -m "feat: add Snowpipe auto-ingest configuration"

# Bug fix
git commit -m "fix: resolve storage integration ARN reference"

# Documentation
git commit -m "docs: add commit message guidelines to README"

# With scope (optional)
git commit -m "feat(snowflake): add file format support for Parquet"

# With breaking change
git commit -m "feat!: change storage integration naming convention"

Why This Matters

Commits are automatically categorized in the changelog
Release notes are generated from commit messages
Makes it easier to understand project history
Enables semantic versioning automation

Development Workflow

Create a feature branch from main
Make your changes
Test in dev environment
Create a pull request with description
Wait for approval and automated deployment

See CONTRIBUTING.md for detailed guidelines.

License

MIT License - See LICENSE for details.

Support

For issues and questions:

Open an issue in this repository
Check existing documentation in the docs/ folder
Review Snowflake documentation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.devcontainer		.devcontainer
.github		.github
infra		infra
input-jsons		input-jsons
snowflake-ddl		snowflake-ddl
utils		utils
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
cliff.toml		cliff.toml

Folders and files

Latest commit

History

Repository files navigation

Snowflake Lakehouse

Overview

Repository Structure

Architecture

Phase 1: AWS Resources

Phase 2: Snowflake Resources

Phase 3: AWS Trust Policy Update

Phase 4: S3 Event Notifications

Getting Started

Prerequisites

One-Time Snowflake Setup

1. Create Dedicated Service Account

Step 1: Generate Key Pair

Step 2: Create Service Account in Snowflake

2. Configure GitHub Secrets and Variables

Required Repository Variables

Required Repository Secrets

How to Get These Values

2a. Configure Codespaces Secrets (For Terraform Development)

3. AWS OIDC Setup (Optional but Recommended)

Snowflake Object Organization

Sample Implementation

GitHub Actions Workflow

Best Practices

Migration Tracking

Migration Table Schema

Baseline Mode

Disabling Migration Tracking

SQL Scripts

Security

Infrastructure

Documentation

Contributing

Commit Message Convention

Commit Types

Examples

Why This Matters

Development Workflow

License

Support

Roadmap

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages