⚡ Bolt: Optimize JSON serialization ~18x for ASCII#2398
⚡ Bolt: Optimize JSON serialization ~18x for ASCII#2398SatoryKono wants to merge 2 commits intomainfrom
Conversation
…scaping Optimized `bioetl.domain.serialization` by replacing manual Python string iteration with `str.isascii()` check and `json.dumps` fallback, resulting in ~18x speedup for ASCII data and ~8x speedup for non-ASCII data compared to the previous manual implementation. Fixed a correctness bug in `OrjsonEncoder` where `ensure_ascii=True` produced invalid JSON escapes (`\xXX`) by using `json.dumps` as a correct fallback. Preserved `dumps_canonical` behavior to maintain hash compatibility. Co-authored-by: SatoryKono <13055362+SatoryKono@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Optimized `bioetl.domain.serialization` by replacing manual Python string iteration with `str.isascii()` check and `json.dumps` fallback, resulting in ~18x speedup for ASCII data and ~8x speedup for non-ASCII data. Fixed a correctness bug in `OrjsonEncoder` where `ensure_ascii=True` produced invalid JSON escapes (`\xXX`) by using `json.dumps` as a correct fallback. Addressed CI failures: - Fixed ambiguous variable names (`l` -> `label`) and unused variables in `src/tools/differentiate_linkstyle.py`. - Formatted `tests/unit/domain/test_serialization.py` to satisfy architecture tests. Co-authored-by: SatoryKono <13055362+SatoryKono@users.noreply.github.com>
|
Superseded by #2423. |
⚡ Bolt: Optimize JSON serialization for ASCII and fix invalid escaping
💡 What:
bioetl.domain.serializationwith optimizedstr.isascii()check.json.dumpsfallback for non-ASCII data.OrjsonEncoder.dumpsto correctly handleensure_ascii=Trueby falling back tojson.dumpsinstead of usingunicode_escape(which produces invalid JSON\xXX).🎯 Why:
_has_non_asciiand_escape_non_asciiwere implemented using Python loops, which are extremely slow (O(N)).OrjsonEncoderwas producing invalid JSON whenensure_ascii=Truewas requested.str.isascii()is implemented in C and is orders of magnitude faster.json.dumpsis implemented in C and handles escaping correctly and efficiently.📊 Impact:
🔬 Measurement:
timeiton 10k iterations of ASCII and non-ASCII datasets.tests/unit/domain/test_serialization.pyandtests/unit/infrastructure/serialization/test_json_encoders.py.PR created automatically by Jules for task 10380415512159935346 started by @SatoryKono