All notable changes to this project will be documented in this file.
- allowing for custom table name (has priority before class name)
- added options to add filter on dependencies and target table based on column-value pairs
- target table can now selectively write based on secondary virtual partitions
- table reader can also filter based on given column-values pairs
- table reader optimization
- Separate writer from runner, sorting schema to align to written table
- Added a debug run option that return dataframe without writing anything
- Feature type normalization to Double and Long
- Replaces 2.1.0 release
- Updated python version to 3.12 and pyspark to 4.0
- Migrated from poetry to UV
- Added merge_schema manual override option
- optimized number of internal read operation when setting up PysparkFeatureLoader
- removed unnecessary count() call from logger function
- fixed module register with disable decorators not offloading dependencies
- PysparkFeatureLoader.feature_schema now also accepts a list of schemas
- the features can now be split into multiple schemas (names of groups must remain unique across all schemas)
- loader now also accepts storage table names interchangeably with feature group names
- assumes that feature group names contain uppercase lettering
- runner no longer checks number of generated records before attempting to write them, instead it retrieves the information from storage
- expanded config overrides logging
- added config validation
- added owner attribute to GroupMetadata class
- email reporting is now optional
- refactored BookKeeper methods
- bookkeeping table is no longer being overwritten, new records are appended instead
- @jobs can now request job_metadata
- job_metadata contains
- name of the job (e.g. model predict)
- information about the job package: distribution_name and version (e.g. test-model, v 0.2.1)
- added run time timestamp to record
- new bookkeeping functionality
- save reporting information to dataframe
- added bookkeeping config option to runner config
- compound keys now work properly in sequential features
- module register now resets on reload while testing
- runner config now accepts environment variables
- restructured runner config
- added metadata and feature loader sections
- target moved to pipeline
- dependency date_col is now mandatory
- custom extras config is available in each pipeline and will be passed as dictionary available under pipeline_config.extras
- general section is renamed to runner
- info_date_shift is always a list
- transformation header changed
- added argument to skip dependency checking
- added overrides parameter to allow for dynamic overriding of config values
- removed date_from and date_to from arguments, use overrides instead
- jobs are now the main way to create all pipelines
- config holder removed from jobs
- metadata_manager and feature_loader are now available arguments, depending on configuration
- added @config decorator, similar use case to @datasource, for parsing configuration
- reworked Resolver + Added ModuleRegister
- datasources no longer just by importing, thus are no longer available for all jobs
- register_dependency_callable and register_dependency_module added to register datasources
- together, it's now possilbe to have 2 datasources with the same name, but different implementations for 2 jobs.
- function signatures changed
- until -> date_until
- info_date_from -> date_from, info_date_to -> date_to
- date_column is now mandatory
- removed TableReaders ability to infer schema from partitions or properties
- removed DataLoader class, now only PysparkFeatureLoader is needed with additional parameters
- passing dependencies from runner to a Transformation
- optional dependency names in the config that could be recalled via dictionary to access paths and date columns
- Rialto now adds rialto_date_column property to written tables
- signature of Transformation
- Allowed future dependencies