dibio: enforce minimum sample size for dictionary training#4618
dibio: enforce minimum sample size for dictionary training#4618BhavyaBibra wants to merge 1 commit intofacebook:devfrom
Conversation
Cyan4973
left a comment
There was a problem hiding this comment.
While the core idea seems sound—enforcing a minimum sample size for dictionary training is a good safeguard, and 8 bytes is indeed extremely small. I also wouldn’t be surprised if a similar constraint is already enforced elsewhere in the pipeline.
However, a tactically detrimental aspect of this PR is that it bundles an unrelated change in source-code indentation style. This causes a large number of purely mechanical diffs across multiple files, which makes it difficult (if not impossible) to determine what functional behavior is actually changing and what the real impact is.
Given the large attack surface exposed through Zstandard, we cannot accept external contributions unless the changes are fully understood.
Please refactor this PR to include only the minimal, essential functional changes, so the review can be completed safely and with high confidence.
2e227f7 to
23ab7d1
Compare
|
Thank you for the review and for pointing that out! I apologize for the noise—my editor's auto-formatter was applying unintended whitespace changes on save. I've updated the PR and force-pushed a clean version. The diff now contains strictly the minimal, essential functional changes to enforce the 8-byte minimum sample size constraint without any unrelated formatting changes. If you compare the changes against the dev branch (e.g., git diff origin/dev..HEAD programs/dibio.c), you'll see it is now completely isolated to just those specific logical additions. Let me know if this looks good to move forward with the review! |
Summary
Enforce a minimum sample size of 8 bytes when gathering training data for dictionary building. Files smaller than this threshold provide negligible training value and may cause edge-case issues in the dictionary builder.
This resolves two TODO comments in dibio.c:
/* TODO: is there a minimum sample size? What if the file is 1-byte? *//* TODO: is there a minimum sample size? Can we have a 1-byte sample? */Changes
programs/dibio.c
In DiB_fileStats() — file stats gathering (line ~282):
DISPLAYLEVEL(3)warningnbSamplescount without meaningful dataIn DiB_fileStats() — chunked sample path (line ~297):
Design Decisions
displayLevel >= 3, not a fatal error. This preserves backwards compatibility for scripts that may pass heterogeneous file listsTesting
make -C programs zstd # ✅ builds successfully with changes