databricks-solutions · jjaiwant328 · Mar 1, 2026
diff --git a/databricks-skills/README.md b/databricks-skills/README.md
@@ -52,6 +52,7 @@ cp -r ai-dev-kit/databricks-skills/databricks-agent-bricks .claude/skills/
 
 ### 📊 Analytics & Dashboards
 - **databricks-aibi-dashboards** - Databricks AI/BI dashboards (with SQL validation workflow)
+- **databricks-powerbi-migration** - Power BI to Databricks migration (metric views, DAX-to-SQL, ERD generation, schema mapping)
 - **databricks-unity-catalog** - System tables for lineage, audit, billing
 
 ### 🔧 Data Engineering

diff --git a/databricks-skills/databricks-powerbi-migration/1-input-scanning.md b/databricks-skills/databricks-powerbi-migration/1-input-scanning.md
@@ -0,0 +1,168 @@
+# Input Scanning & Model Parsing (Steps 1–2)
+
+Steps 1 and 2 of the migration workflow: classify all input files and parse Power BI models.
+
+---
+
+## Step 1: Scan, Classify, and Confirm All Inputs
+
+**Before doing anything else**, read every file in `input/`. Classify each file by content — not extension.
+
+```bash
+python scripts/scan_inputs.py input/ -o reference/input_manifest.json
+```
+
+Detects: `pbi_model`, `csv_schema_dump`, `mapping_json`, `dbx_schema`, `sql_ddl`, `sql_query_output`, `csv_data`, `sample_report`, `databricks_config`, `unknown`.
+
+**Present classification to the user and ask:**
+1. "I found these files. Here is what each appears to be: [list]. Is this correct?"
+2. "How should I use each file?"
+3. If no Databricks schema info found: offer schema suggestion queries (Step 5 in [2-catalog-resolution.md](2-catalog-resolution.md)).
+
+**Do not proceed until the user confirms.**
+
+### Input File Types
+
+| Type | Format | Description |
+|------|--------|-------------|
+| `pbi_model` | `.pbit`, `.pbix`, `.bim`, TMDL directory, or JSON | Exported semantic model — detected by content, not extension |
+| `csv_schema_dump` | CSV with `table_name`, `column_name`, `data_type` headers | Schema metadata exported from INFORMATION_SCHEMA |
+| `mapping_json` | JSON with `mappings` array | Column-level mappings (Scenario C or D) |
+| `dbx_schema` | JSON, SQL DDL, or query output | Schema information from Databricks |
+| `sample_report` | `.docx`, `.pdf`, `.png`, `.jpg`, `.xlsx`, `.pptx` | Sample report for KPI reverse-engineering |
+| `databricks_config` | YAML with `host`/`token` keys | Workspace URL, PAT, warehouse, catalog, schema |
+| `csv_data` | CSV | Headers can inform schema |
+
+### CSV Schema Dump Detection
+
+A CSV file is classified as `csv_schema_dump` when its header row contains columns matching these patterns (case-insensitive):
+
+- `table_name` / `tableName` / `TABLE_NAME`
+- `column_name` / `columnName` / `COLUMN_NAME`
+- `data_type` / `dataType` / `DATA_TYPE`
+
+At least `table_name` and `column_name` must be present. When a `csv_schema_dump` is detected:
+
+1. Parse the CSV to extract table names, column names, and data types.
+2. Build a schema representation equivalent to `extract_dbx_schema.py` output.
+3. Use this schema for comparison in Step 6.
+
+---
+
+## Step 2: Parse Power BI Models
+
+```bash
+python scripts/parse_pbi_model.py input/<file> -o reference/pbi_model.json
+# or batch mode:
+python scripts/parse_pbi_model.py input/ -o reference/pbi_model.json
+```
+
+The parser handles any file extension — tries ZIP, JSON, and TMDL detection in sequence.
+
+**Content detection order:** ZIP archive → JSON structure → TMDL text → TMDL directory
+
+### How to Export Power BI Models
+
+**Option 1: PBIT file (recommended)**
+1. Open your report in Power BI Desktop.
+2. File > Export > Power BI Template (`.pbit`).
+3. Place in `input/`.
+
+**Option 2: PBIX file**
+The parser extracts the DataModelSchema from `.pbix` files directly. Place in `input/`.
+
+**Option 3: BIM file**
+1. Open the model in [Tabular Editor](https://tabulareditor.com/).
+2. File > Save As > `model.bim`.
+3. Place in `input/`.
+
+**Option 4: TMDL directory**
+1. Enable TMDL in Power BI Desktop (Options > Preview Features).
+2. File > Save As > TMDL format.
+3. Place the directory in `input/`.
+
+**Option 5: Manual description**
+Provide table names, column names with types, DAX measures, and relationships.
+
+---
+
+## After Parsing: Immediate Catalog Validation
+
+**Immediately after parsing (before proceeding to ERD or KPI steps)**, extract all data source references from the PBI model — server names, database/catalog names, and schema names found in `partitions[].source` (connection strings, M expressions, or `Sql.Database` calls).
+
+Cross-reference these against:
+1. Schema files provided in `input/` (DDL, CSV schema dump, JSON schema)
+2. Databricks config in `input/` (host/token/catalog)
+3. Live MCP access (test with `execute_sql`)
+
+**If a referenced catalog is inaccessible AND no schema dump was provided**, raise a warning immediately:
+
+> "The PBI model references data from `<catalog>.<schema>`, but I have no schema information and cannot access this catalog. Please provide one of:
+> 1. A schema dump (CSV, DDL, or JSON) in the `input/` folder
+> 2. Databricks credentials with access to this catalog
+> 3. Run this query and paste the output: `SELECT table_name, column_name, data_type FROM <catalog>.information_schema.columns WHERE table_schema = '<schema>'`"
+
+**Do not proceed past Step 5 without resolving all catalog gaps.** See [2-catalog-resolution.md](2-catalog-resolution.md) for the full catalog resolution workflow.
+
+### Extracting Data Sources from the Parsed Model
+
+Data source references are in:
+
+1. **Partition source expressions** (`partitions[].source.expression`):
+   Look for `Sql.Database("server", "database")` in M code.
+
+   ```
+   let Source = Sql.Database("myserver.database.windows.net", "my_catalog"),
+       gold = Source{[Schema="gold"]}[Data], ...
+   ```
+
+2. **Connection string annotations** (`model.annotations` or `model.dataSources`):
+   Some models store explicit connection strings with server, database, catalog, and schema.
+
+3. **Table source metadata** (`tables[].partitions[].source`):
+   For DirectQuery tables, the `source` object may contain `schema` and `entity` (table) names.
+
+```python
+import re
+
+def extract_data_sources(model: dict) -> list[dict]:
+    sources = []
+    for table in model.get("tables", []):
+        for partition in table.get("partitions", []):
+            src = partition.get("source", {})
+            expr = src.get("expression", "")
+            match = re.search(r'Sql\.Database\("([^"]+)",\s*"([^"]+)"\)', expr)
+            if match:
+                sources.append({"server": match.group(1), "catalog": match.group(2), "table": table.get("name")})
+            schema_match = re.search(r'\[Schema="([^"]+)"\]', expr)
+            if schema_match and sources:
+                sources[-1]["schema"] = schema_match.group(1)
+    return sources
+```
+
+---
+
+## Project Structure Reference
+
+```
+project-root/
+├── input/                             # ALL user-provided files
+│   ├── model.pbix
+│   ├── mapping.json                   # Optional
+│   ├── schema_dump.sql                # Optional
+│   ├── schema_export.csv              # Optional
+│   ├── sample_report.pdf              # Optional
+│   └── databricks.yml                 # Optional
+├── reference/
+│   ├── input_manifest.json            # Output of scan_inputs.py
+│   └── pbi_model.json                 # Output of parse_pbi_model.py
+└── temp/                              # Working/throwaway files
+```
+
+Initialize with:
+
+```bash
+bash scripts/init_project.sh
+# With all folders:
+bash scripts/init_project.sh --all
+```
diff --git a/databricks-skills/databricks-powerbi-migration/2-catalog-resolution.md b/databricks-skills/databricks-powerbi-migration/2-catalog-resolution.md
@@ -0,0 +1,200 @@
+# Catalog Resolution & Schema Extraction (Steps 3–5)
+
+Steps 3, 4, and 5 of the migration workflow: validate catalog accessibility, resolve catalog names, and extract Databricks schema.
+
+---
+
+## Step 3: Validate Catalog Accessibility
+
+**Immediately after parsing**, extract all data source references from the PBI model and cross-reference against:
+1. Schema files provided in `input/` (DDL, CSV schema dump, JSON schema)
+2. Databricks config in `input/` (host/token/catalog)
+3. Live MCP access (test with `execute_sql`)
+
+**If MCP is available**, test accessibility for each referenced catalog:
+
+```sql
+SELECT 1 FROM <catalog>.information_schema.tables LIMIT 1;
+```
+
+Launch parallel subagents when multiple catalogs need testing — see [9-subagent-patterns.md](9-subagent-patterns.md).
+
+### Warning Message Template
+
+If a catalog is referenced but neither accessible nor covered by input files:
+
+```
+⚠ Missing catalog access: The PBI model references `<catalog>.<schema>` (used by tables: <table_list>),
+but I have no schema information and cannot access this catalog.
+
+Please provide one of:
+1. A schema dump (CSV, DDL, or JSON) for `<catalog>.<schema>` in the input/ folder
+2. Databricks credentials with access to this catalog
+3. Run this query and paste the output:
+   SELECT table_name, column_name, data_type, is_nullable, comment
+   FROM <catalog>.information_schema.columns
+   WHERE table_schema = '<schema>'
+   ORDER BY table_name, ordinal_position;
+```
+
+Update `reference/catalog_resolution.md` with an accessibility status section:
+
+```markdown
+## Catalog Accessibility Status
+
+| Catalog | Schema | Status | Source |
+|---------|--------|--------|--------|
+| my_catalog | gold | ✅ Accessible | Live MCP query |
+| other_catalog | silver | ✅ Covered | input/other_catalog_schema.csv |
+| missing_catalog | dbo | ❌ Inaccessible | No schema info — user action required |
+```
+
+**Do not proceed past Step 5 without resolving all catalog gaps.** ERD/domain generation (Step 8) can proceed with PBI-only data, but schema comparison and metric view creation require catalog access or schema dumps.
+
+---
+
+## Step 4: Resolve Catalog
+
+First, list all available catalogs:
+
+```sql
+SELECT catalog_name FROM system.information_schema.catalogs ORDER BY catalog_name;
+```
+
+Then probe schemas within the target catalog:
+
+```sql
+SELECT schema_name FROM <catalog>.information_schema.schemata;
+```
+
+Then verify table existence:
+
+```sql
+SELECT table_name
+FROM <catalog>.information_schema.tables
+WHERE table_schema = '<schema>';
+```
+
+**When multiple candidate catalogs exist** (e.g., `analytics`, `fc_analytics`), launch parallel subagents — one per catalog — to probe concurrently. See [9-subagent-patterns.md](9-subagent-patterns.md) Pattern A.
+
+### Handling fc_ Prefix
+
+Some environments prefix catalog names with `fc_`. The agent should:
+
+1. Try the catalog name as provided
+2. If not found, try with `fc_` prefix
+3. If not found, try without `fc_` prefix
+4. Document both primary and fallback catalog in `reference/catalog_resolution.md`
+
+### Output: catalog_resolution.md
+
+```markdown
+## Catalog Resolution
+
+- **Primary catalog**: `my_catalog`
+- **Fallback catalog**: `fc_my_catalog` (if applicable)
+- **Target schema**: `gold`
+- **Tables found**: 15 (listed below)
+- **Tables missing**: 2 (listed below)
+
+### Table Inventory
+| Table | Catalog | Schema | Row Count (est.) |
+|-------|---------|--------|------------------|
+| sales_fact | my_catalog | gold | ~10M |
+| product_dim | fc_my_catalog | gold | ~50K |
+```
+
+---
+
+## Step 5: Extract or Ingest Databricks Schema
+
+If no schema was found in `input/`, suggest these queries:
+
+```sql
+-- Full column schema (recommended)
+SELECT table_name, column_name, data_type, is_nullable, comment
+FROM <catalog>.information_schema.columns
+WHERE table_schema = '<schema>'
+ORDER BY table_name, ordinal_position;
+
+-- Cross-schema comparison
+SELECT table_schema, table_name, column_name, data_type
+FROM <catalog>.information_schema.columns
+WHERE table_schema IN ('<schema_a>', '<schema_b>')
+ORDER BY table_schema, table_name, ordinal_position;
+
+-- Per-table detail
+DESCRIBE TABLE EXTENDED <catalog>.<schema>.<table_name>;
+```
+
+Tell the user: *"Paste output in chat, save to input/, or provide catalog.schema for programmatic extraction."*
+
+**Programmatic extraction** via MCP:
+
+```
+CallMcpTool:
+  server: "user-databricks"
+  toolName: "get_table_details"
+  arguments: {"catalog": "<catalog>", "schema": "<schema>", "table_stat_level": "SIMPLE"}
+```
+
+When extracting from **multiple catalogs or schemas**, launch parallel subagents — one per catalog/schema pair. See [9-subagent-patterns.md](9-subagent-patterns.md) Pattern B.
+
+Also accept DDL files and CSV schema dumps as schema sources.
+
+### CSV Schema Dump as Schema Source
+
+A CSV with INFORMATION_SCHEMA-style headers is treated as equivalent to `extract_dbx_schema.py` output:
+
+```csv
+table_name,column_name,data_type,is_nullable,comment
+sales_fact,sale_id,BIGINT,NO,Primary key
+sales_fact,total_amount,DECIMAL(18,2),YES,Order total
+customer_dim,customer_id,BIGINT,NO,Primary key
+```
+
+Detection criteria: headers must contain `table_name` + `column_name` (at minimum). `data_type` is strongly expected but not strictly required.
+
+---
+
+## Cross-Schema and Multi-Catalog Environments
+
+### Discover All Schemas in a Catalog
+
+```sql
+SELECT schema_name FROM <catalog>.information_schema.schemata;
+```
+
+### Discover All Catalogs
+
+```sql
+SELECT catalog_name FROM system.information_schema.catalogs;
+```
+
+### Cross-Schema Column Comparison
+
+```sql
+SELECT table_schema, table_name, column_name, data_type
+FROM <catalog>.information_schema.columns
+WHERE table_schema IN ('<schema_a>', '<schema_b>')
+ORDER BY table_schema, table_name, ordinal_position;
+```
+
+### When to Use Cross-Schema Probing
+
+- PBI model references tables from multiple schemas
+- Table names exist in multiple schemas (need to disambiguate)
+- Migration involves consolidating schemas
+
+---
+
+## Scripts
+
+### extract_dbx_schema.py
+
+```bash
+python scripts/extract_dbx_schema.py \
+  my_catalog my_schema -o reference/dbx_schema.json [--profile PROD]
+```
+
+**Dependencies:** `databricks-sdk`