Skip to content

dibio: enforce minimum sample size for dictionary training#4618

Open
BhavyaBibra wants to merge 1 commit intofacebook:devfrom
BhavyaBibra:contrib/min-sample-size
Open

dibio: enforce minimum sample size for dictionary training#4618
BhavyaBibra wants to merge 1 commit intofacebook:devfrom
BhavyaBibra:contrib/min-sample-size

Conversation

@BhavyaBibra
Copy link

Summary

Enforce a minimum sample size of 8 bytes when gathering training data for dictionary building. Files smaller than this threshold provide negligible training value and may cause edge-case issues in the dictionary builder.

This resolves two TODO comments in dibio.c:

  • Line 282: /* TODO: is there a minimum sample size? What if the file is 1-byte? */
  • Line 297: /* TODO: is there a minimum sample size? Can we have a 1-byte sample? */

Changes

programs/dibio.c

In DiB_fileStats() — file stats gathering (line ~282):

  • Files with size > 0 but < 8 bytes are now skipped with a DISPLAYLEVEL(3) warning
  • Prevents tiny files from inflating nbSamples count without meaningful data

In DiB_fileStats() — chunked sample path (line ~297):

  • Same 8-byte minimum applied when files are split into sample chunks
  • Skips with warning instead of hard error (backwards-compatible)

Design Decisions

  • Threshold of 8 bytes was chosen as a conservative minimum — large enough to filter noise, small enough to avoid rejecting legitimate edge cases
  • Warning, not error — files are skipped with a log message at displayLevel >= 3, not a fatal error. This preserves backwards compatibility for scripts that may pass heterogeneous file lists
  • No new CLI flag — the threshold is not configurable. If maintainers prefer a configurable option, this can be extended

Testing

make -C programs zstd  # ✅ builds successfully with changes

@meta-cla meta-cla bot added the CLA Signed label Mar 9, 2026
Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the core idea seems sound—enforcing a minimum sample size for dictionary training is a good safeguard, and 8 bytes is indeed extremely small. I also wouldn’t be surprised if a similar constraint is already enforced elsewhere in the pipeline.

However, a tactically detrimental aspect of this PR is that it bundles an unrelated change in source-code indentation style. This causes a large number of purely mechanical diffs across multiple files, which makes it difficult (if not impossible) to determine what functional behavior is actually changing and what the real impact is.

Given the large attack surface exposed through Zstandard, we cannot accept external contributions unless the changes are fully understood.

Please refactor this PR to include only the minimal, essential functional changes, so the review can be completed safely and with high confidence.

@BhavyaBibra BhavyaBibra force-pushed the contrib/min-sample-size branch from 2e227f7 to 23ab7d1 Compare March 13, 2026 15:36
@BhavyaBibra
Copy link
Author

Thank you for the review and for pointing that out! I apologize for the noise—my editor's auto-formatter was applying unintended whitespace changes on save.

I've updated the PR and force-pushed a clean version. The diff now contains strictly the minimal, essential functional changes to enforce the 8-byte minimum sample size constraint without any unrelated formatting changes.

If you compare the changes against the dev branch (e.g., git diff origin/dev..HEAD programs/dibio.c), you'll see it is now completely isolated to just those specific logical additions.

Let me know if this looks good to move forward with the review!

@BhavyaBibra BhavyaBibra requested a review from Cyan4973 March 13, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants