⚡ Bolt: lazy hash computation in DuplicateFinder#5
⚡ Bolt: lazy hash computation in DuplicateFinder#5RazorBackRoar wants to merge 2 commits intomainfrom
Conversation
Co-authored-by: RazorBackRoar <184320956+RazorBackRoar@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba276ee3f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for prev_file in self._files_by_size[file_size]: | ||
| if prev_file not in self._hashed_files: |
There was a problem hiding this comment.
Avoid re-scanning every same-size file on each lookup
This loop executes on every is_duplicate call for an existing size bucket, even when all prior files are already hashed, so a run with many same-size files performs 1+2+...+N bucket scans (quadratic path/set checks) before finishing. In large media batches where file sizes frequently match, this can turn duplicate detection into a major hotspot and regress runtime versus the previous linear behavior; the bucket should be finalized once (or tracked incrementally) instead of re-iterated for every new file.
Useful? React with 👍 / 👎.
Co-authored-by: RazorBackRoar <184320956+RazorBackRoar@users.noreply.github.com>
💡 What: Refactored
DuplicateFinderto perform "lazy hashing." Instead of immediately reading the file from disk and computing a SHA-256 hash for every file encountered, it first groups files by their file size.🎯 Why: In the vast majority of cases, a file with a unique size cannot be a duplicate. Hashing every file is extremely expensive in terms of disk I/O and CPU time.
📊 Impact: Completely eliminates the cost of reading/hashing files that have unique sizes. It reduces file I/O operations by 100% for non-duplicate files, offering massive performance gains during bulk processing of large video files.
🔬 Measurement: Run the duplicate finder on a directory of heterogeneous files. Monitor disk I/O (e.g., using
iotoporActivity Monitordisk reads); the read operations will significantly decrease since unique files are skipped entirely.PR created automatically by Jules for task 12392763489017902986 started by @RazorBackRoar