Feature/improve data gen by dustinvannoy-db · Pull Request #184 · databricks-solutions/ai-dev-kit

dustinvannoy-db · 2026-02-25T19:52:44Z

No description provided.

The serverless() method requires databricks-connect 15.1.0+, but version 17.x only supports Python 3.12. Updated documentation to specify: - Python 3.10/3.11: use >=15.1,<16.2 - Python 3.12: use >=16.2 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…alog management - Strongly recommend Spark + Faker for all data generation (default approach) - Only use Polars for <10K rows if user explicitly prefers local generation - Add volume upload instructions using databricks fs commands - Remove CREATE CATALOG statements - assume catalogs already exist - Update decision guides and examples to reflect Spark-first approach - Consolidate and simplify execution options and installation instructions - Update best practices and common issues sections Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Rename databricks-synthetic-data-generation to databricks-synthetic-data-gen across all install scripts, documentation, and cross-references to match the actual skill directory name - Add missing skills (databricks-iceberg, databricks-parsing) to install.sh and install.ps1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Bugs: - Remove .cache()/.unpersist() in generate_synthetic_data.py (serverless incompatible) - Fix .gitignore formatting (restore blank line separator) Design: - Refactor ground_truth.yaml to use external response files (1127 → 347 lines) - Change timeout from 480s to 240s with explanatory comment - Add Windows timeout warning in mlflow_eval.py Nits: - Fix hardcoded catalog name (dustin_vannoy_catalog → my_catalog) - Fix DatabricksEnv import path (databricks.connect.session → databricks.connect) - Add EOF newline to 1-setup-and-execution.md - Remove unused imports in evaluate.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove new_cluster section and use environment_key at task level for cleaner serverless job definition. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

DatabricksEnv requires databricks-connect>=16.4 which requires Python 3.12+. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Expand version constraint from >=16.4,<17.0 to >=16.4,<17.4 to support databricks-connect 17.x versions - Fix get_databricks_connect_version() to use importlib.metadata.version() instead of non-existent databricks.connect.__version__ attribute Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

calreynolds

🔥

malcolndandaro

Looks great, did some deployments with multiple parallel agents and everything worked well.

dustinvannoy-db and others added 14 commits February 15, 2026 18:09

Rewrite synthetic-data-generation for improved performance and features

3738572

Merge branch 'main' into feature/improve_data_gen

58f92d8

Cleanup data gen skill

fccf575

Add stronger guidance to use Databricks Connect

eb82b21

Update data gen for different run modes

c9ec683

Small updates to databricks-connect and environments

728e454

Updates to improve serverless dbconnect and polars local for data gen

3f2c9e0

Add guidance on cache with serverless

c15572f

Update data gen for better cluster/job guidance

bdb3ab6

Update classic library install

0b9c9b3

Suggest uv and improve python task job payload

d177f62

Merge branch 'main' into feature/improve_data_gen

0e61f04

dustinvannoy-db linked an issue Feb 25, 2026 that may be closed by this pull request

Skill Testing: Data Gen #105

Closed

dustinvannoy-db and others added 7 commits February 26, 2026 16:19

Add new data gen tests (first 3)

84ae64f

Update data gen ground_truth and baseline

ded1cf2

Remove default catalog setting

c680269

Add window syntax common issue

09a9cd8

Rename and overhaul data gen skill and tests timeouts

c7e335a

Merge branch 'main' into feature/improve_data_gen

d7b3c07

calreynolds self-requested a review March 3, 2026 19:26

dustinvannoy-db and others added 5 commits March 3, 2026 13:43

Simplify serverless job config in test response

d1a8660

Remove new_cluster section and use environment_key at task level for cleaner serverless job definition. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Python 3.12+ requirement to run instructions

9c74e61

DatabricksEnv requires databricks-connect>=16.4 which requires Python 3.12+. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove commented out lines from manifest.yaml

aa4d8c9

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dustinvannoy-db marked this pull request as ready for review March 3, 2026 23:06

calreynolds previously approved these changes Mar 3, 2026

View reviewed changes

Reduce guidelines for faster tests with mlflwo

8265a9b

dustinvannoy-db dismissed calreynolds’s stale review via 8265a9b March 3, 2026 23:50

malcolndandaro reviewed Mar 4, 2026

View reviewed changes

calreynolds approved these changes Mar 4, 2026

View reviewed changes

calreynolds merged commit 0228fe8 into main Mar 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/improve data gen#184

Feature/improve data gen#184
calreynolds merged 27 commits intomainfrom
feature/improve_data_gen

dustinvannoy-db commented Feb 25, 2026

Uh oh!

calreynolds left a comment

Uh oh!

malcolndandaro left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dustinvannoy-db commented Feb 25, 2026

Uh oh!

calreynolds left a comment

Choose a reason for hiding this comment

Uh oh!

malcolndandaro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants