Skip to content

Latest commit

 

History

History
303 lines (191 loc) · 11.6 KB

File metadata and controls

303 lines (191 loc) · 11.6 KB

Data Prep Kit Release notes

Release 1.1.6 - 11/13/2025

Transforms

  1. OpenSearch Transform: Enables keyword and vector-based search capabilities using OpenSearch
    • If the table includes an embeddings column, the transform sets up a k-NN vector index for similarity searches.
  2. Image Transform Modality: Introduced three new transforms for processing image data:
    • Faces: Detects people and faces using a pre-trained face detection model.
    • NSFW: Scores content for Not Safe For Work using Hugging Face image-classification pipeline.
    • People: Counts faces and supports face blurring for privacy.
  3. Docling2Parquet: Updated options to extract binary image data (images/pages) into a dedicated column (image_bins) in Parquet output.

General

  1. Tekton deployment yamls: Introduced Kubernetes deployment YAMLs for Tekton, simplifying pipeline composition without relying on Kubeflow Pipelines (KFP) infrastructure.
  2. Input Handling: Expanded runtime input support to include ZIP, NDJSON, and JSON formats in addition to Parquet.
  3. Logging System: Implemented a new JSON-based logging system that consolidates all DPK logs into a single logger.

Release 1.1.5 - 10/2/2025

Transforms

  1. Granite Docling Integration: Enabled document parsing via docling2parquet, with options for VLM pipeline compatibility.

  2. PII Redactor: Added support for cryptographic redaction.

  3. GneissWeb: Enhanced multithreading and optimized model loading for better performance.

  4. Filter Transform: Added safeguard to check if filter_criteria is None, preventing crashes when criteria are unset or empty.

General

  1. Python Multiprocessing: Introduced multiprocessing job support and resolved boto pickling errors for transform runtimes.

Release 1.1.4 - 9/15/2025

General

  1. Improved logging to remove access and secret keys from config when present for legacy runs.
  2. Resolved issues related to handling additional secrets when more than one secret added to config for kfp.

Transforms

  1. Added support for binary transforms and binary data in chained operations (new examples and test coverage provided).
  2. Updated filter transform to return an empty table while preserving the original schema when filtering results in empty table
  3. Updated tokenization2arrow to correctly process lists of texts.

Dependency updates

  1. Avoided using polars version 1.33 due to breaking changes.
  2. Removed lower bound constraint on boto3 dependency.

Release 1.1.3 - 8/18/2025

General

  1. Fixed the bug with MD file as input for docling2parquet
  2. Adjust tagged dependencies to ensure notebooks work in google collab environment
  3. Prepare post1 release with patches
  4. Parse metadata.json at end of the run and flag for exceptions, removing errors where logs would show failure, but KFP would show success
  5. Updated model_loader to utilize data_access_s3, enabling s3 I/O from different COS locations
  6. Removing non-required torch dependency
  7. Updated validation for data_access_local, allowing empty input_folder and/or output_folder (defaults to current directory)
  8. Fixed bug with dividing by 0 in fine web quality annotator

Release 1.1.2.post1 - 7/3/2025

General

  1. Patch filter failing when transform is used with default/empty configuration
  2. Patch PII requirements for pydantic to allow testing with Prefect

Release 1.1.2 - 7/3/2025

General

  1. Restructured data-access package to allow adding user specific connectors as external packages (e.g. lakehouse connector)
  2. Removed credentials being utilized as transform/data access arguments and now passed set as environment variable
  3. Added runtime code location environment variables to docker files to display real build information
  4. Added in-memory data access for caching reads/writes in DataAccessLocal, and in new DataAccessMemory class.
  5. Added file batch processing for data access

Transforms

  1. Added transform chain module for running one more transforms in sequence, with support for parallel micro-batch execution
  2. Added fineweb_quality_annotator and gopher_repetition_annotator transforms
  3. Added model loader util for transforms utilizing models, enabling loading from COS, HuggingFace, and locally
  4. Updated KPF workflows to remove setting runtime code location, and set credentials via environment variables

Release 1.1.1 - 4/9/2025

General

  1. Move to the Linux Foundation AI&DATA Organization and re-assign roles for maintainers and contributors
  2. Added logo, updated documentation and enforce signed contributions
  3. Added support for Hugging Face credentials when running verification workflows
  4. Added new transforms and bug fixes

Transforms

  1. Added Multi-lingual ML Filter and Enrichment(Quality annotation) transforms
  2. Added Blockist and Collapse(Column concatenation) transforms
  3. Added support for comments based semantic categories
  4. Ededup added parameter for optional removed field
  5. Refactored all code transforms and published via wheel in pypi

Recipes

  1. Refactored, re-organized and continued improving notebooks and recipes

Release 1.1.0 - 3/10/2025

General

  1. Updated tutorials and documentations
  2. Added GneissWeb Transforms, GneissWeb recipe and support for XML ingest in Docling
  3. Bug fixes for Windows Support, KFP workflow pipelines, and CI/CD workflow

Recipes

  1. Updated RAG Notebooks for PDF and HTML
  2. New GneissWeb Notebook showcasing advanced data prep operations for improved model performance
  3. New Agentic Notebook showcasing integration with Langchain and Llama-index

Transforms

  1. GneissWeb transforms: extreme tokenized, readability, gneissweb classification, Rep Removal, Tokenization2Arrow, Bloom
  2. Code Profiler: Added Support for CSharp
  3. Header Cleanser: Enhanced with multi-processing support
  4. Fuzzy Dedup: Support for Windows Folder names
  5. PDF to Parquet: Update docling to 2.25 for ingesting XML/JATS
  6. HAP: Assign 0 score for empty content

data-prep-toolkit libraries (python, ray, spark)

  1. Disabled fcntl on Windows

KFP Pipelines

  1. Updated super pipeline KFPv2

Release 1.0.0 - 1/24/2025

General

  1. Refactored all language transforms and implemented simplified APIs for the refactored transforms
  2. Added notebook examples for each of the transforms
  3. Streamlined documentation and added tutorial for developers who want to build new transforms
  4. Other minor enhancements and bug fixes were done for transforms, workflow pipelines, and CI/CD makefiles

Transforms

  1. Added new similarity transform (for detecting confidentiality, copyright, and/or plagiarism in documents)

Release 0.2.3 - 12/15/2024

General

New algorithm for Fuzzy dedup transform Sample notebooks for some of the language transforms Integrate Semantic profiler and report generation for code profiler transform

data-prep-toolkit libraries (python, ray, spark)

  1. Increase ray agent limit to 10,000 (default was 100)

Transforms

  1. Fuzzy dedup new algorithm for Python, Ray and Spark

Release 0.2.2 - 11/25/2024

General

  1. Update RAG example to use granite model
  2. Updated transforms with Docling 2
  3. Added single package for dpk with extra for [spark] and [ray]
  4. Added single package for transforms with extra for [all] or [individual-transform-name]

data-prep-toolkit libraries (python, ray, spark)

  1. Fix metadata logging even when actors crash
  2. Add multilock for ray workers downloads/cleanup
  3. Multiple updates to spark runtime
  4. Added support for python 3.12
  5. refactoring of data access code

KFP Workloads

  1. Modify superpipeline params type Str/json
  2. Set kuberay apiserver version
  3. Add Super pipeline for code transforms

Transforms

  1. Enhance docling2parquet with docling2 support for extracting HTML, DOCS, etc.
  2. Added web2parquet transform
  3. Added HAP transform

HTTP Connector 0.2.3

  1. Enhanced parameter/configuration allows the user to customize crawler settings
  2. implement subdomain focus feature in data-prep-connector

Release 0.2.2- HTTP Connector Module - 10/23/2024

General

  1. Bug fixes across the repo
  2. Minor enhancements and experimentation with single packaging techniques using [extra]
  3. Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
  4. The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
  5. The patch digit for the release of any one component can be increased independently from other component patch number

data-prep-toolkit-Connector

  1. Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline

Release 0.2.1 - 9/24/2024

General

  1. Bug fixes across the repo
  2. Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
  3. Added new transforms and single package for transforms published to pypi
  4. Improved CI/CD with targeted workflow triggered on specific changes to specific modules
  5. New enhancements for cutting a release

data-prep-toolkit libraries (python, ray, spark)

  1. Restructure the repository to distinguish/separate runtime libraries
  2. Split data-processing-lib/ray into python and ray
  3. Spark runtime
  4. Updated pyarrow version
  5. Define required transform() method as abstract to AbstractTableTransform
  6. Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies

KFP Workloads

  1. Add a configurable timeout before destroying the deployed Ray cluster.

Transforms

  1. Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, docling2parquet, HTML2Parquet and PII Transform
  2. Added ededup python implementation and incremental ededup
  3. Added fuzzy floating point comparison

Release 0.2.0 - 6/27/2024

General

  1. Many bug fixes across the repo, plus the following specifics.
  2. Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
  3. Automation of release process branch/tag management
  4. Documentation improvements

data-prep-toolkit libraries (python, ray, spark)

  1. Split libraries into 3 runtime-specific implementations
  2. Fix missing final count of processed and add percentages
  3. Improved fault tolerance in python and ray runtimes
  4. Report global DataAccess retry metric
  5. Support for binary data transforms
  6. Updated to Ray version to 2.24
  7. Updated to PyArrow version 16.1.0

KFP Workloads

  1. Add KFP V2 support
  2. Create a distinct (timestamped) execution.log file for each retry
  3. Support for multiple inputs/outputs

Transforms

  1. Added language/lang_id - detects language in documents
  2. Added universal/profiler - counts works/tokens in documents
  3. Converted ingest2parquet tool to transform named code2parquet
  4. Split transforms, as appropriate, into python, ray and/or spark.
  5. Added spark implementations of filter, doc_id and noop transforms.
  6. Switch from using requirements.txt to pyproject.toml file for each transform runtime
  7. Repository restructured to move kfp workflow definitions to associated transform project directory

Release 0.1.1 - 5/24/2024

Release 0.1.0 - 5/15/2024

Release 0.1.0 - 5/08/2024