Skip to content

Feature/improve data gen#184

Merged
calreynolds merged 27 commits intomainfrom
feature/improve_data_gen
Mar 4, 2026
Merged

Feature/improve data gen#184
calreynolds merged 27 commits intomainfrom
feature/improve_data_gen

Conversation

@dustinvannoy-db
Copy link
Collaborator

No description provided.

dustinvannoy-db and others added 14 commits February 15, 2026 18:09
The serverless() method requires databricks-connect 15.1.0+, but version
17.x only supports Python 3.12. Updated documentation to specify:
- Python 3.10/3.11: use >=15.1,<16.2
- Python 3.12: use >=16.2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…alog management

- Strongly recommend Spark + Faker for all data generation (default approach)
- Only use Polars for <10K rows if user explicitly prefers local generation
- Add volume upload instructions using databricks fs commands
- Remove CREATE CATALOG statements - assume catalogs already exist
- Update decision guides and examples to reflect Spark-first approach
- Consolidate and simplify execution options and installation instructions
- Update best practices and common issues sections

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@dustinvannoy-db dustinvannoy-db linked an issue Feb 25, 2026 that may be closed by this pull request
dustinvannoy-db and others added 7 commits February 26, 2026 16:19
- Rename databricks-synthetic-data-generation to databricks-synthetic-data-gen
  across all install scripts, documentation, and cross-references to match
  the actual skill directory name
- Add missing skills (databricks-iceberg, databricks-parsing) to install.sh
  and install.ps1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@calreynolds calreynolds self-requested a review March 3, 2026 19:26
dustinvannoy-db and others added 5 commits March 3, 2026 13:43
Bugs:
- Remove .cache()/.unpersist() in generate_synthetic_data.py (serverless incompatible)
- Fix .gitignore formatting (restore blank line separator)

Design:
- Refactor ground_truth.yaml to use external response files (1127 → 347 lines)
- Change timeout from 480s to 240s with explanatory comment
- Add Windows timeout warning in mlflow_eval.py

Nits:
- Fix hardcoded catalog name (dustin_vannoy_catalog → my_catalog)
- Fix DatabricksEnv import path (databricks.connect.session → databricks.connect)
- Add EOF newline to 1-setup-and-execution.md
- Remove unused imports in evaluate.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove new_cluster section and use environment_key at task level
for cleaner serverless job definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DatabricksEnv requires databricks-connect>=16.4 which requires Python 3.12+.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Expand version constraint from >=16.4,<17.0 to >=16.4,<17.4 to
  support databricks-connect 17.x versions
- Fix get_databricks_connect_version() to use importlib.metadata.version()
  instead of non-existent databricks.connect.__version__ attribute

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dustinvannoy-db dustinvannoy-db marked this pull request as ready for review March 3, 2026 23:06
calreynolds
calreynolds previously approved these changes Mar 3, 2026
Copy link
Collaborator

@calreynolds calreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Collaborator

@malcolndandaro malcolndandaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, did some deployments with multiple parallel agents and everything worked well.

@calreynolds calreynolds merged commit 0228fe8 into main Mar 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skill Testing: Data Gen

3 participants