- OpenSearch Transform: Enables keyword and vector-based search capabilities using OpenSearch
- If the table includes an embeddings column, the transform sets up a k-NN vector index for similarity searches.
- Image Transform Modality: Introduced three new transforms for processing image data:
- Faces: Detects people and faces using a pre-trained face detection model.
- NSFW: Scores content for Not Safe For Work using Hugging Face image-classification pipeline.
- People: Counts faces and supports face blurring for privacy.
- Docling2Parquet: Updated options to extract binary image data (images/pages) into a dedicated column (
image_bins) in Parquet output.
- Tekton deployment yamls: Introduced Kubernetes deployment YAMLs for Tekton, simplifying pipeline composition without relying on Kubeflow Pipelines (KFP) infrastructure.
- Input Handling: Expanded runtime input support to include ZIP, NDJSON, and JSON formats in addition to Parquet.
- Logging System: Implemented a new JSON-based logging system that consolidates all DPK logs into a single logger.
-
Granite Docling Integration: Enabled document parsing via docling2parquet, with options for VLM pipeline compatibility.
-
PII Redactor: Added support for cryptographic redaction.
-
GneissWeb: Enhanced multithreading and optimized model loading for better performance.
-
Filter Transform: Added safeguard to check if filter_criteria is None, preventing crashes when criteria are unset or empty.
- Python Multiprocessing: Introduced multiprocessing job support and resolved boto pickling errors for transform runtimes.
- Improved logging to remove access and secret keys from config when present for legacy runs.
- Resolved issues related to handling additional secrets when more than one secret added to config for kfp.
- Added support for binary transforms and binary data in chained operations (new examples and test coverage provided).
- Updated filter transform to return an empty table while preserving the original schema when filtering results in empty table
- Updated tokenization2arrow to correctly process lists of texts.
- Avoided using polars version 1.33 due to breaking changes.
- Removed lower bound constraint on boto3 dependency.
- Fixed the bug with MD file as input for docling2parquet
- Adjust tagged dependencies to ensure notebooks work in google collab environment
- Prepare post1 release with patches
- Parse metadata.json at end of the run and flag for exceptions, removing errors where logs would show failure, but KFP would show success
- Updated model_loader to utilize data_access_s3, enabling s3 I/O from different COS locations
- Removing non-required torch dependency
- Updated validation for data_access_local, allowing empty input_folder and/or output_folder (defaults to current directory)
- Fixed bug with dividing by 0 in fine web quality annotator
- Patch filter failing when transform is used with default/empty configuration
- Patch PII requirements for pydantic to allow testing with Prefect
- Restructured data-access package to allow adding user specific connectors as external packages (e.g. lakehouse connector)
- Removed credentials being utilized as transform/data access arguments and now passed set as environment variable
- Added runtime code location environment variables to docker files to display real build information
- Added in-memory data access for caching reads/writes in DataAccessLocal, and in new DataAccessMemory class.
- Added file batch processing for data access
- Added transform chain module for running one more transforms in sequence, with support for parallel micro-batch execution
- Added fineweb_quality_annotator and gopher_repetition_annotator transforms
- Added model loader util for transforms utilizing models, enabling loading from COS, HuggingFace, and locally
- Updated KPF workflows to remove setting runtime code location, and set credentials via environment variables
- Move to the Linux Foundation AI&DATA Organization and re-assign roles for maintainers and contributors
- Added logo, updated documentation and enforce signed contributions
- Added support for Hugging Face credentials when running verification workflows
- Added new transforms and bug fixes
- Added Multi-lingual ML Filter and Enrichment(Quality annotation) transforms
- Added Blockist and Collapse(Column concatenation) transforms
- Added support for comments based semantic categories
- Ededup added parameter for optional removed field
- Refactored all code transforms and published via wheel in pypi
- Refactored, re-organized and continued improving notebooks and recipes
- Updated tutorials and documentations
- Added GneissWeb Transforms, GneissWeb recipe and support for XML ingest in Docling
- Bug fixes for Windows Support, KFP workflow pipelines, and CI/CD workflow
- Updated RAG Notebooks for PDF and HTML
- New GneissWeb Notebook showcasing advanced data prep operations for improved model performance
- New Agentic Notebook showcasing integration with Langchain and Llama-index
- GneissWeb transforms: extreme tokenized, readability, gneissweb classification, Rep Removal, Tokenization2Arrow, Bloom
- Code Profiler: Added Support for CSharp
- Header Cleanser: Enhanced with multi-processing support
- Fuzzy Dedup: Support for Windows Folder names
- PDF to Parquet: Update docling to 2.25 for ingesting XML/JATS
- HAP: Assign 0 score for empty content
- Disabled fcntl on Windows
- Updated super pipeline KFPv2
- Refactored all language transforms and implemented simplified APIs for the refactored transforms
- Added notebook examples for each of the transforms
- Streamlined documentation and added tutorial for developers who want to build new transforms
- Other minor enhancements and bug fixes were done for transforms, workflow pipelines, and CI/CD makefiles
- Added new similarity transform (for detecting confidentiality, copyright, and/or plagiarism in documents)
New algorithm for Fuzzy dedup transform Sample notebooks for some of the language transforms Integrate Semantic profiler and report generation for code profiler transform
- Increase ray agent limit to 10,000 (default was 100)
- Fuzzy dedup new algorithm for Python, Ray and Spark
- Update RAG example to use granite model
- Updated transforms with Docling 2
- Added single package for dpk with extra for [spark] and [ray]
- Added single package for transforms with extra for [all] or [individual-transform-name]
- Fix metadata logging even when actors crash
- Add multilock for ray workers downloads/cleanup
- Multiple updates to spark runtime
- Added support for python 3.12
- refactoring of data access code
- Modify superpipeline params type Str/json
- Set kuberay apiserver version
- Add Super pipeline for code transforms
- Enhance docling2parquet with docling2 support for extracting HTML, DOCS, etc.
- Added web2parquet transform
- Added HAP transform
- Enhanced parameter/configuration allows the user to customize crawler settings
- implement subdomain focus feature in data-prep-connector
- Bug fixes across the repo
- Minor enhancements and experimentation with single packaging techniques using [extra]
- Decoupled the release process for each of the component so we can be more responsive to the needs of our stakeholders
- The minor digit for the release for all components is incremented and the patch digit is reset to 0 for all new releases of the data-prep-toolkit
- The patch digit for the release of any one component can be increased independently from other component patch number
- Released first version of the data-prep-toolkit-connector for crawling web sites and downloading HTML and PDF files for ingestion by the pipeline
- Bug fixes across the repo
- Added AI Alliance RAG demo, tutorials and notebooks and tips for running on google colab
- Added new transforms and single package for transforms published to pypi
- Improved CI/CD with targeted workflow triggered on specific changes to specific modules
- New enhancements for cutting a release
- Restructure the repository to distinguish/separate runtime libraries
- Split data-processing-lib/ray into python and ray
- Spark runtime
- Updated pyarrow version
- Define required transform() method as abstract to AbstractTableTransform
- Enables configuration of makefile to use src or pypi for data-prep-kit library dependencies
- Add a configurable timeout before destroying the deployed Ray cluster.
- Added 7 new transdforms including: language identification, profiler, repo level ordering, doc quality, docling2parquet, HTML2Parquet and PII Transform
- Added ededup python implementation and incremental ededup
- Added fuzzy floating point comparison
- Many bug fixes across the repo, plus the following specifics.
- Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
- Automation of release process branch/tag management
- Documentation improvements
- Split libraries into 3 runtime-specific implementations
- Fix missing final count of processed and add percentages
- Improved fault tolerance in python and ray runtimes
- Report global DataAccess retry metric
- Support for binary data transforms
- Updated to Ray version to 2.24
- Updated to PyArrow version 16.1.0
- Add KFP V2 support
- Create a distinct (timestamped) execution.log file for each retry
- Support for multiple inputs/outputs
- Added language/lang_id - detects language in documents
- Added universal/profiler - counts works/tokens in documents
- Converted ingest2parquet tool to transform named code2parquet
- Split transforms, as appropriate, into python, ray and/or spark.
- Added spark implementations of filter, doc_id and noop transforms.
- Switch from using requirements.txt to pyproject.toml file for each transform runtime
- Repository restructured to move kfp workflow definitions to associated transform project directory