Data Prep Kit Release notes

Release 1.1.6 - 11/13/2025

Transforms

OpenSearch Transform: Enables keyword and vector-based search capabilities using OpenSearch
- If the table includes an embeddings column, the transform sets up a k-NN vector index for similarity searches.
Image Transform Modality: Introduced three new transforms for processing image data:
- Faces: Detects people and faces using a pre-trained face detection model.
- NSFW: Scores content for Not Safe For Work using Hugging Face image-classification pipeline.
- People: Counts faces and supports face blurring for privacy.
Docling2Parquet: Updated options to extract binary image data (images/pages) into a dedicated column (image_bins) in Parquet output.

General

Tekton deployment yamls: Introduced Kubernetes deployment YAMLs for Tekton, simplifying pipeline composition without relying on Kubeflow Pipelines (KFP) infrastructure.
Input Handling: Expanded runtime input support to include ZIP, NDJSON, and JSON formats in addition to Parquet.
Logging System: Implemented a new JSON-based logging system that consolidates all DPK logs into a single logger.

Release 1.1.5 - 10/2/2025

Transforms

Granite Docling Integration: Enabled document parsing via docling2parquet, with options for VLM pipeline compatibility.
PII Redactor: Added support for cryptographic redaction.
GneissWeb: Enhanced multithreading and optimized model loading for better performance.
Filter Transform: Added safeguard to check if filter_criteria is None, preventing crashes when criteria are unset or empty.

General

Python Multiprocessing: Introduced multiprocessing job support and resolved boto pickling errors for transform runtimes.

Release 1.1.4 - 9/15/2025

General

Improved logging to remove access and secret keys from config when present for legacy runs.
Resolved issues related to handling additional secrets when more than one secret added to config for kfp.

Transforms

Added support for binary transforms and binary data in chained operations (new examples and test coverage provided).
Updated filter transform to return an empty table while preserving the original schema when filtering results in empty table
Updated tokenization2arrow to correctly process lists of texts.

Dependency updates

Avoided using polars version 1.33 due to breaking changes.
Removed lower bound constraint on boto3 dependency.

Release 1.1.3 - 8/18/2025

General

Fixed the bug with MD file as input for docling2parquet
Adjust tagged dependencies to ensure notebooks work in google collab environment
Prepare post1 release with patches
Parse metadata.json at end of the run and flag for exceptions, removing errors where logs would show failure, but KFP would show success
Updated model_loader to utilize data_access_s3, enabling s3 I/O from different COS locations
Removing non-required torch dependency
Updated validation for data_access_local, allowing empty input_folder and/or output_folder (defaults to current directory)
Fixed bug with dividing by 0 in fine web quality annotator

Release 1.1.2.post1 - 7/3/2025

General

Patch filter failing when transform is used with default/empty configuration
Patch PII requirements for pydantic to allow testing with Prefect

Release 1.1.2 - 7/3/2025

General

Restructured data-access package to allow adding user specific connectors as external packages (e.g. lakehouse connector)
Removed credentials being utilized as transform/data access arguments and now passed set as environment variable
Added runtime code location environment variables to docker files to display real build information
Added in-memory data access for caching reads/writes in DataAccessLocal, and in new DataAccessMemory class.
Added file batch processing for data access

Transforms

Added transform chain module for running one more transforms in sequence, with support for parallel micro-batch execution
Added fineweb_quality_annotator and gopher_repetition_annotator transforms
Added model loader util for transforms utilizing models, enabling loading from COS, HuggingFace, and locally
Updated KPF workflows to remove setting runtime code location, and set credentials via environment variables

Release 1.1.1 - 4/9/2025

General

Move to the Linux Foundation AI&DATA Organization and re-assign roles for maintainers and contributors
Added logo, updated documentation and enforce signed contributions
Added support for Hugging Face credentials when running verification workflows
Added new transforms and bug fixes

Transforms

Added Multi-lingual ML Filter and Enrichment(Quality annotation) transforms
Added Blockist and Collapse(Column concatenation) transforms
Added support for comments based semantic categories
Ededup added parameter for optional removed field
Refactored all code transforms and published via wheel in pypi

Recipes

Refactored, re-organized and continued improving notebooks and recipes

Release 1.1.0 - 3/10/2025

General

Updated tutorials and documentations
Added GneissWeb Transforms, GneissWeb recipe and support for XML ingest in Docling
Bug fixes for Windows Support, KFP workflow pipelines, and CI/CD workflow

Recipes

Updated RAG Notebooks for PDF and HTML
New GneissWeb Notebook showcasing advanced data prep operations for improved model performance
New Agentic Notebook showcasing integration with Langchain and Llama-index

Transforms

GneissWeb transforms: extreme tokenized, readability, gneissweb classification, Rep Removal, Tokenization2Arrow, Bloom
Code Profiler: Added Support for CSharp
Header Cleanser: Enhanced with multi-processing support
Fuzzy Dedup: Support for Windows Folder names
PDF to Parquet: Update docling to 2.25 for ingesting XML/JATS
HAP: Assign 0 score for empty content

data-prep-toolkit libraries (python, ray, spark)

Disabled fcntl on Windows

KFP Pipelines

Updated super pipeline KFPv2

Release 1.0.0 - 1/24/2025

General

Refactored all language transforms and implemented simplified APIs for the refactored transforms
Added notebook examples for each of the transforms
Streamlined documentation and added tutorial for developers who want to build new transforms
Other minor enhancements and bug fixes were done for transforms, workflow pipelines, and CI/CD makefiles

Transforms

Added new similarity transform (for detecting confidentiality, copyright, and/or plagiarism in documents)

Release 0.2.3 - 12/15/2024

General

New algorithm for Fuzzy dedup transform Sample notebooks for some of the language transforms Integrate Semantic profiler and report generation for code profiler transform

data-prep-toolkit libraries (python, ray, spark)

Increase ray agent limit to 10,000 (default was 100)

Transforms

Fuzzy dedup new algorithm for Python, Ray and Spark

Release 0.2.2 - 11/25/2024

General

Update RAG example to use granite model
Updated transforms with Docling 2
Added single package for dpk with extra for [spark] and [ray]
Added single package for transforms with extra for [all] or [individual-transform-name]

data-prep-toolkit libraries (python, ray, spark)

Fix metadata logging even when actors crash
Add multilock for ray workers downloads/cleanup
Multiple updates to spark runtime
Added support for python 3.12
refactoring of data access code

KFP Workloads

Modify superpipeline params type Str/json
Set kuberay apiserver version
Add Super pipeline for code transforms

Transforms

Enhance docling2parquet with docling2 support for extracting HTML, DOCS, etc.
Added web2parquet transform
Added HAP transform

HTTP Connector 0.2.3

Enhanced parameter/configuration allows the user to customize crawler settings
implement subdomain focus feature in data-prep-connector

Release 0.2.2- HTTP Connector Module - 10/23/2024

General

Bug fixes across the repo
Minor enhancements and experimentation with single packaging techniques using [extra]
Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
The patch digit for the release of any one component can be increased independently from other component patch number

data-prep-toolkit-Connector

Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline

Release 0.2.1 - 9/24/2024

General

Bug fixes across the repo
Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
Added new transforms and single package for transforms published to pypi
Improved CI/CD with targeted workflow triggered on specific changes to specific modules
New enhancements for cutting a release

data-prep-toolkit libraries (python, ray, spark)

Restructure the repository to distinguish/separate runtime libraries
Split data-processing-lib/ray into python and ray
Spark runtime
Updated pyarrow version
Define required transform() method as abstract to AbstractTableTransform
Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies

KFP Workloads

Add a configurable timeout before destroying the deployed Ray cluster.

Transforms

Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, docling2parquet, HTML2Parquet and PII Transform
Added ededup python implementation and incremental ededup
Added fuzzy floating point comparison

Release 0.2.0 - 6/27/2024

General

Many bug fixes across the repo, plus the following specifics.
Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
Automation of release process branch/tag management
Documentation improvements

data-prep-toolkit libraries (python, ray, spark)

Split libraries into 3 runtime-specific implementations
Fix missing final count of processed and add percentages
Improved fault tolerance in python and ray runtimes
Report global DataAccess retry metric
Support for binary data transforms
Updated to Ray version to 2.24
Updated to PyArrow version 16.1.0

KFP Workloads

Add KFP V2 support
Create a distinct (timestamped) execution.log file for each retry
Support for multiple inputs/outputs

Transforms

Added language/lang_id - detects language in documents
Added universal/profiler - counts works/tokens in documents
Converted ingest2parquet tool to transform named code2parquet
Split transforms, as appropriate, into python, ray and/or spark.
Added spark implementations of filter, doc_id and noop transforms.
Switch from using requirements.txt to pyproject.toml file for each transform runtime
Repository restructured to move kfp workflow definitions to associated transform project directory

FilesExpand file tree

release-notes.md

Latest commit

History

release-notes.md

File metadata and controls

Data Prep Kit Release notes

Release 1.1.6 - 11/13/2025

Transforms

General

Release 1.1.5 - 10/2/2025

Transforms

General

Release 1.1.4 - 9/15/2025

General

Transforms

Dependency updates

Release 1.1.3 - 8/18/2025

General

Release 1.1.2.post1 - 7/3/2025

General

Release 1.1.2 - 7/3/2025

General

Transforms

Release 1.1.1 - 4/9/2025

General

Transforms

Recipes

Release 1.1.0 - 3/10/2025

General

Recipes

Transforms

data-prep-toolkit libraries (python, ray, spark)

KFP Pipelines

Release 1.0.0 - 1/24/2025

General

Transforms

Release 0.2.3 - 12/15/2024

General

data-prep-toolkit libraries (python, ray, spark)

Transforms

Release 0.2.2 - 11/25/2024

General

data-prep-toolkit libraries (python, ray, spark)

KFP Workloads

Transforms

HTTP Connector 0.2.3

Release 0.2.2- HTTP Connector Module - 10/23/2024

General

data-prep-toolkit-Connector

Release 0.2.1 - 9/24/2024

General

data-prep-toolkit libraries (python, ray, spark)

KFP Workloads

Transforms

Release 0.2.0 - 6/27/2024

General

data-prep-toolkit libraries (python, ray, spark)

KFP Workloads

Transforms

Release 0.1.1 - 5/24/2024

Release 0.1.0 - 5/15/2024

Release 0.1.0 - 5/08/2024