A Snowflake Lakehouse implementation with AWS and Infrastructure as Code (Terraform), automated deployment using GitHub Actions.
This repository tracks the build of an end-to-end Snowflake data engineering solution—from source data analysis and ingestion design to layered stage/raw/curated modeling, automation with DAG + GitHub Actions, dynamic tables, and Streamlit dashboards—using Snowpark Python and marketplace datasets.
The project demonstrates a complete data lakehouse implementation with:
- Infrastructure as Code: Terraform configurations for AWS (S3, IAM) and Snowflake resources
- Layered Data Architecture: Stage → Raw → Curated data modeling pattern
- Automated Ingestion: Snowpipe for real-time data loading from S3
- Data Transformation: Snowpark Python for ETL/ELT processing
- Orchestration: DAG-based workflows with GitHub Actions CI/CD
- Dynamic Tables: Incremental data processing with automatic refresh
- Visualization: Streamlit dashboards for data exploration
- Marketplace Integration: Leveraging Snowflake marketplace datasets
.
├── infra/ # Infrastructure as Code (Terraform)
│ ├── platform/tf/ # Root orchestration module (entry point)
│ │ ├── main.tf # Orchestrates AWS + Snowflake modules
│ │ ├── locals.tf # Configuration parsing from JSON
│ │ ├── variables.tf # Input variables
│ │ ├── outputs.tf # Module outputs
│ │ ├── versions.tf # Provider version constraints
│ │ ├── backend.tf # Terraform Cloud backend
│ │ ├── providers-aws.tf # AWS provider configuration
│ │ └── providers-snowflake.tf # Snowflake provider configuration
│ ├── aws/tf/ # AWS child module
│ │ ├── main.tf # S3 bucket + IAM role orchestration
│ │ ├── modules/ # Nested modules (s3, iam, iam_role_final, s3_event_notification)
│ │ └── templates/ # Bucket policy templates
│ └── snowflake/tf/ # Snowflake child module
│ ├── main.tf # Warehouses, databases, stages, pipes
│ └── modules/ # Nested modules (warehouse, database, stage, etc.)
├── input-jsons/ # Configuration files
│ ├── aws/config.json # AWS resource configuration
│ └── snowflake/config.json # Snowflake resource configuration
├── snowflake-ddl/ # Snowflake DDL Scripts
│ ├── 00_account/ # Account-level objects (resource monitors, network policies)
│ ├── 01_security/ # Roles, users, grants
│ ├── 02_warehouses/ # Virtual warehouses
│ ├── 03_databases/ # Database definitions
│ ├── 04_storage/ # Storage integrations & external stages
│ ├── 05_schemas/ # Schema-level objects (tables, views)
│ ├── 06_pipes/ # Snowpipe definitions
│ ├── 07_tasks/ # Task definitions
│ ├── 08_functions/ # UDFs and UDTFs
│ ├── 09_procedures/ # Stored procedures
│ ├── environments/ # Environment configs (dev/staging/prod)
│ └── scripts/ # Utility scripts (deploy, rollback, validate)
├── .github/
│ └── workflows/ # GitHub Actions workflows (ci.yaml, etc.)
├── .devcontainer/ # Dev container configuration
├── cliff.toml # git-cliff changelog configuration
└── utils/ # Utility scripts
This project uses a 4-phase architecture:
- S3 Bucket for data storage
- IAM Role with placeholder trust policy
- Warehouses, Databases, Schemas
- File Formats, Storage Integration
- External Stages, Tables, Snowpipes
- Update IAM Role trust policy with Snowflake's IAM User ARN and External ID
- Configure S3 bucket notifications to trigger Snowpipe auto-ingest
- Terraform >= 1.0
- Snowflake Account with appropriate permissions
- AWS Account with IAM permissions
- GitHub Repository with Actions enabled
Before using this action, run the following SQL script in Snowflake to create the utility infrastructure (only needs to be run once):
Step 1: Create Utility Infrastructure
-- =========================================================
-- Snowflake Utility Setup for DDL Migrations
-- =========================================================
-- This script creates:
-- 1. A dedicated warehouse for CI/CD metadata operations
-- 2. Utility database (UTIL_DB)
-- 3. Utility schema (UTIL_SCHEMA)
-- 4. DDL migration history table
--
-- Safe to re-run (idempotent)
-- =========================================================
-- -----------------------------------------------------------
-- 1. Create and use a dedicated warehouse
-- -----------------------------------------------------------
CREATE WAREHOUSE IF NOT EXISTS UTIL_WH
WAREHOUSE_SIZE = 'XSMALL'
WAREHOUSE_TYPE = 'STANDARD'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE
COMMENT = 'Warehouse for CI/CD utility operations and DDL migration tracking';
USE WAREHOUSE UTIL_WH;
-- -----------------------------------------------------------
-- 2. Create utility database and schema
-- -----------------------------------------------------------
CREATE DATABASE IF NOT EXISTS UTIL_DB
COMMENT = 'Utility database for CI/CD metadata and migration tracking';
CREATE SCHEMA IF NOT EXISTS UTIL_DB.UTIL_SCHEMA
COMMENT = 'Utility schema for migration and operational tables';
-- -----------------------------------------------------------
-- 3. Create DDL migration history table
-- -----------------------------------------------------------
CREATE TABLE IF NOT EXISTS UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY (
script_name STRING NOT NULL,
script_path STRING NOT NULL,
checksum STRING NOT NULL,
applied_at TIMESTAMP_LTZ NOT NULL DEFAULT CURRENT_TIMESTAMP(),
status STRING NOT NULL,
error_message STRING,
run_id STRING,
actor STRING
) COMMENT = 'Tracks executed Snowflake DDL migration scripts for CI/CD pipelines';
-- -----------------------------------------------------------
-- 4. (Optional) Verify creation
-- -----------------------------------------------------------
SELECT
'UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY created successfully' AS status,
CURRENT_TIMESTAMP() AS verified_at;Step 2: Grant MANAGE GRANTS Privilege to SYSADMIN
SYSADMIN needs the MANAGE GRANTS privilege to grant permissions to other roles like PUBLIC. Run this as ACCOUNTADMIN:
USE ROLE ACCOUNTADMIN;
-- Grant MANAGE GRANTS privilege to SYSADMIN
-- This allows SYSADMIN to grant privileges on objects it owns
GRANT MANAGE GRANTS ON ACCOUNT TO ROLE SYSADMIN;
-- Verify the grant
SHOW GRANTS TO ROLE SYSADMIN;Note: With this setup, SYSADMIN can both create objects and manage their permissions, simplifying the deployment process.
Note: If you want to use a different database/schema/table name, you can customize it using the migrations_table input parameter in the GitHub Actions workflow.
For security best practices, create a dedicated service account for GitHub Actions instead of using your personal account.
On your local machine, generate an RSA key pair:
Option A: Without Passphrase (Recommended for CI/CD)
# Generate unencrypted PKCS8 private key (no passphrase)
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_key.p8 -nocrypt
# Generate public key
openssl rsa -in snowflake_key.p8 -pubout -out snowflake_key.pubOption B: With Passphrase (For enhanced security)
# Generate encrypted PKCS8 private key (with passphrase)
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out snowflake_key.p8 -v2 aes-256-cbc
# Generate public key
openssl rsa -in snowflake_key.p8 -pubout -out snowflake_key.pubExtract public key value (for both options):
# Remove header/footer and newlines for Snowflake
grep -v "BEGIN PUBLIC" snowflake_key.pub | grep -v "END PUBLIC" | tr -d '\n'Save the output - you'll need it for the next step.
Note: If using a passphrase, you'll need to provide SNOWFLAKE_PRIVATE_KEY_PASSPHRASE as an additional secret.
Run this SQL in Snowflake (replace YOUR_PUBLIC_KEY_HERE with the output from Step 1):
-- =========================================================
-- Create Service Account for GitHub Actions
-- =========================================================
-- Create dedicated service account
CREATE USER IF NOT EXISTS GH_ACTIONS_USER
RSA_PUBLIC_KEY = 'YOUR_PUBLIC_KEY_HERE'
DEFAULT_ROLE = SYSADMIN
DEFAULT_WAREHOUSE = COMPUTE_WH
MUST_CHANGE_PASSWORD = FALSE
COMMENT = 'Service account for GitHub Actions CI/CD deployments';
-- Grant SYSADMIN role (for DDL and grant operations)
GRANT ROLE SYSADMIN TO USER GH_ACTIONS_USER;
-- Grant usage on warehouses
GRANT USAGE ON WAREHOUSE UTIL_WH TO ROLE SYSADMIN;
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE SYSADMIN;
-- Grant usage on the utility database
GRANT USAGE ON DATABASE UTIL_DB TO ROLE SYSADMIN;
GRANT USAGE ON SCHEMA UTIL_DB.UTIL_SCHEMA TO ROLE SYSADMIN;
-- Grant create privileges for the migration table
GRANT CREATE TABLE ON SCHEMA UTIL_DB.UTIL_SCHEMA TO ROLE SYSADMIN;
-- Grant all privileges on the migration table (if it already exists)
GRANT ALL PRIVILEGES ON TABLE UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY TO ROLE SYSADMIN;
-- If the user needs to create the database/schema (first run)
GRANT CREATE DATABASE ON ACCOUNT TO ROLE SYSADMIN;
-- Verify the user's role
DESC USER GH_ACTIONS_USER;
-- See what roles the user has
SHOW GRANTS TO USER GH_ACTIONS_USER;
-- See what the SYSADMIN role can do
SHOW GRANTS TO ROLE SYSADMIN;
Security Notes:
- ✅ Use
SYSADMINrole for all DDL and grant operations - ✅ Grant
MANAGE GRANTSprivilege to SYSADMIN for permission management - ✅ Key-pair authentication is more secure than passwords
- ✅ Service accounts provide better audit trails
- ✅ Never commit private keys to the repository
Set up GitHub Actions authentication. Navigate to Settings → Secrets and variables → Actions.
| Variable Name | Description | Example |
|---|---|---|
SNOWFLAKE_ORGANIZATION_NAME |
Snowflake organization name | AGXUOKJ |
SNOWFLAKE_ACCOUNT_NAME |
Snowflake account name | JKC15404 |
SNOWFLAKE_USER |
Service account username | GH_ACTIONS_USER |
SNOWFLAKE_ROLE |
Snowflake role for deployments | SYSADMIN |
AWS_REGION |
AWS region for resources | us-east-1 |
TF_LINT_VER |
TFLint version (optional) | v0.50.0 |
| Secret Name | Description |
|---|---|
SNOWFLAKE_PRIVATE_KEY |
Content of snowflake_key.p8 file (including -----BEGIN/END PRIVATE KEY----- headers) |
TF_TOKEN_APP_TERRAFORM_IO |
Terraform Cloud API token for remote backend |
AWS_OIDC_ROLE_ARN |
AWS IAM role ARN for OIDC authentication (e.g., arn:aws:iam::123456789012:role/github-actions-role) |
Snowflake Variables:
- Log into Snowflake
- Organization name: Found in your account URL (
https://<org>-<account>.snowflakecomputing.com) - Account name: Same as above
- User/Role: Created in the service account setup (Step 1)
Snowflake Private Key:
- Generated in Step 1 (
snowflake_key.p8) - Copy the entire file content including headers
Terraform Cloud Token:
- Go to Terraform Cloud
- Navigate to User Settings → Tokens
- Create a new API token
AWS OIDC Role ARN:
- Set up OIDC in AWS (see AWS OIDC Setup)
- Copy the IAM role ARN
If you're running Terraform from GitHub Codespaces, you need to configure Codespaces secrets for authentication.
Quick setup:
Navigate to: Settings → Secrets and variables → Codespaces
Add these secrets:
Snowflake Authentication:
| Secret Name | Description |
|---|---|
TF_VAR_snowflake_organization_name |
Snowflake organization name |
TF_VAR_snowflake_account_name |
Snowflake account name |
TF_VAR_snowflake_user |
Snowflake username |
TF_VAR_snowflake_private_key |
Content of snowflake_key.p8 |
TF_VAR_snowflake_role |
Set to SYSADMIN |
AWS Authentication:
| Secret Name | Description |
|---|---|
AWS_ACCESS_KEY_ID |
From AWS IAM |
AWS_SECRET_ACCESS_KEY |
From AWS IAM |
AWS_DEFAULT_REGION |
e.g., us-east-1 |
Note: GitHub Actions secrets and Codespaces secrets are stored separately. You need to configure both, but you can use the same values.
For secure GitHub Actions authentication with AWS without long-lived credentials, set up OIDC (OpenID Connect). This eliminates the need to store AWS access keys in GitHub Secrets.
See detailed setup instructions: infra/aws/README.md
Benefits:
- ✅ No AWS access keys stored in GitHub Secrets
- ✅ Short-lived tokens that expire automatically
- ✅ Improved security posture
- ✅ Recommended by AWS and GitHub
Scripts are organized by execution order:
- 00_account: Resource monitors, network policies
- 01_security: Roles, users, grants
- 02_warehouses: Virtual warehouses
- 03_databases: Database creation
- 04_storage: Storage integrations and external stages
- 05_schemas: Tables, views, streams
- 06_pipes: Snowpipe for automated ingestion
- 07_tasks: Scheduled tasks
- 08_functions: User-defined functions
- 09_procedures: Stored procedures
The repository includes sample implementations:
- Warehouse:
COMPUTE_WH(small, auto-suspend) - Database:
RAW_DBwith sales, marketing, finance schemas - Tables:
customer_orders- Order transactionscustomer_master- Customer dataproduct_catalog- Product information
The deployment workflow (snowflake-deploy.yaml) automatically:
- Discovers all SQL files in the repository
- Deploys them in dependency order
- Runs files in parallel within each stage
- Uses the reusable action:
subhamay-bhattacharyya-gha/snowflake-run-ddl-action
Triggers:
- Push to
mainordevelopbranches (whensnowflake/**files change) - Pull requests to
mainordevelop - Manual workflow dispatch
By default, the action tracks which scripts have been applied using a migrations table. This enables:
- Idempotent execution: Scripts are only run once (based on path + checksum)
- Change detection: If a script's content changes, it will be re-run
- Audit trail: Complete history of what was applied, when, and by whom
The default table UTIL_DB.UTIL_SCHEMA.DDL_MIGRATION_HISTORY contains:
script_name- Filename of the scriptscript_path- Full path to the scriptchecksum- SHA-256 hash of the script contentapplied_at- Timestamp when appliedstatus- SUCCESS or FAILEDerror_message- Error details if failedrun_id- GitHub Actions run IDactor- GitHub user who triggered the run
Use baseline mode to mark existing scripts as applied without executing them. This is useful when:
- Adopting this action in an existing environment
- Scripts have already been manually applied
- You want to start tracking from a known state
To enable baseline mode in the workflow:
- name: Deploy with baseline
uses: subhamay-bhattacharyya-gha/snowflake-run-ddl-action@v1
with:
baseline: true
# ... other parametersTo run scripts without tracking (not recommended for production):
- name: Deploy without tracking
uses: subhamay-bhattacharyya-gha/snowflake-run-ddl-action@v1
with:
track_migrations: false
# ... other parameters- Use
CREATE OR REPLACEorCREATE IF NOT EXISTSfor idempotency - Add meaningful comments to all objects
- Number files for execution order (01_, 02_, etc.)
- Test in dev before promoting to staging/prod
- Never commit credentials or private keys
- Use service accounts for automation
- Implement least privilege access
- Rotate keys regularly
- Use remote state storage for Terraform
- Enable state locking
- Tag all resources consistently
- Use separate environments (dev/staging/prod)
This project uses Conventional Commits for automated changelog generation. Please format your commit messages as follows:
<type>: <description>
[optional body]
| Type | Description | Example |
|---|---|---|
feat |
New feature or functionality | feat: add Azure storage integration support |
fix |
Bug fix | fix: correct IAM trust policy condition |
docs |
Documentation changes | docs: update README with setup instructions |
style |
Code style changes (formatting, whitespace) | stylhttps://agxuokj-jkc15404.snowflakecomputing.com/console/login?activationToken=ver%3A1-hint%3A344489740-ETMsDgAAAZuzoPggABRBRVMvQ0JDL1BLQ1M1UGFkZGluZwEAABAAEBldmu8VANRBCTUgQE%2F7RGgAAABg%2Bi1xEnXGEcqx%2BVMauNO9GmzhCnHTRbWhExX%2Ftsk%2BfZHPKbTjNV61u9%2B%2BjuAiPOgpm%2FYk6MsqkwrbcUM5%2F9LYDHnEoUuMjYN5A7MZDQWpWfx2y6ERIZO3Uq1CuKFbCZbEABTZyEHS0WcfOoqbc3Dw6%2FyEs1zyow%3D%3De: fix indentation in main.tf |
refactor |
Code refactoring without feature changes | refactor: simplify locals.tf configuration |
perf |
Performance improvements | perf: optimize S3 bucket policy lookup |
test |
Adding or updating tests | test: add validation for warehouse config |
chore |
Maintenance tasks, dependencies | chore: update Terraform provider versions |
ci |
CI/CD configuration changes | ci: add changelog generation to workflow |
# Feature
git commit -m "feat: add Snowpipe auto-ingest configuration"
# Bug fix
git commit -m "fix: resolve storage integration ARN reference"
# Documentation
git commit -m "docs: add commit message guidelines to README"
# With scope (optional)
git commit -m "feat(snowflake): add file format support for Parquet"
# With breaking change
git commit -m "feat!: change storage integration naming convention"- Commits are automatically categorized in the changelog
- Release notes are generated from commit messages
- Makes it easier to understand project history
- Enables semantic versioning automation
- Create a feature branch from
main - Make your changes
- Test in dev environment
- Create a pull request with description
- Wait for approval and automated deployment
See CONTRIBUTING.md for detailed guidelines.
MIT License - See LICENSE for details.
For issues and questions:
- Open an issue in this repository
- Check existing documentation in the
docs/folder - Review Snowflake documentation
- Add data quality checks
- Implement dbt integration
- Add monitoring and alerting
- Create CI/CD for data pipelines
- Add Streamlit dashboards
- Implement dynamic tables