Skip to content

Fix MAG L1C gap-fill bugs for validation tests T015 and T016#2899

Open
sapols wants to merge 3 commits intoIMAP-Science-Operations-Center:devfrom
sapols:mag-l1c-fix
Open

Fix MAG L1C gap-fill bugs for validation tests T015 and T016#2899
sapols wants to merge 3 commits intoIMAP-Science-Operations-Center:devfrom
sapols:mag-l1c-fix

Conversation

@sapols
Copy link
Copy Markdown

@sapols sapols commented Apr 1, 2026

Summary

Six localized bug fixes to the existing MAG L1C gap-filling code. These fixes un-skip and pass validation tests T015 and T016 (both mago and magi) while preserving the existing passes on T013, T014, and T024. All 10 L1C validation cases now pass with exact row-count matches against the checked-in expected outputs.

This PR intentionally keeps the current production algorithm shape. It does not rewrite the gap-fill pipeline, add new dataclasses, or replace the simplified L1C flow that was developed collaboratively between SDC and the MAG team. It only patches the specific correctness bugs that were causing T015 and T016 to fail.

Background

What was failing and why

Current dev skips T015 and T016 because they fail. The root causes are six bugs in the gap-fill path, each described below. T013, T014, and T024 were already passing on dev.

The validation tests exercise increasingly complex mode transitions:

  • T013/T014: Simple NM-to-BM-to-NM transitions, no Config mode. Already passing.
  • T015: NM-to-BM-to-NM with a Config mode transition at the start of burst. NM rate changes from 2 Hz to 4 Hz (mago) or 2 Hz to 1 Hz (magi).
  • T016: NM-to-BM-to-NM with a Config mode transition at the end of burst. NM rate changes from 4 Hz to 2 Hz (mago) or 1 Hz to 2 Hz (magi).
  • T024: Burst-only, no normal data at all.

T015 and T016 were the failing cases: they test mixed-rate transitions where the instrument changes its normal-mode sample rate across a burst window, passing through Config mode. These are the cases that exposed all six bugs.

How this was investigated

This issue was investigated using two AI coding agents (Claude Code and Codex) in parallel, guided by manual review. The process was:

  1. Initial exploration: Both agents independently identified the same set of bugs from the failing tests. Their proposed fixes overlapped significantly but differed in approach.
  2. Spec review: The MAG Science Algorithm Document (IMAP-MAG-SW-009-01B, i5r1) and the SDC Data Validation Document (IMAP-OPS-TP-ICL-001) were consulted. Both agents initially proposed full spec-aligned rewrites of section 7.3.4 step 3, but this was scaled back after MAG team feedback (see below).
  3. MAG team feedback: @alastairtree and @maxinelasp clarified that the current code was developed in collaboration with the MAG team and that deviations from the algorithm document may be intentional simplifications. The team requested a conservative fix rather than a full rewrite.
  4. Minimal patches: Both agents produced minimal draft PRs that preserved the existing algorithm shape. These were iteratively compared and refined until both converged on identical code.
  5. CDF validation on flight data: The code was run on real commissioning data (2025-09-27, 2025-09-28, 2025-10-03, 2025-12-05, 2025-12-06) to compare outputs between dev and the patched code. An exhaustive search of the MAG L1B archive through 2026-03-31 found only 8 same-day norm+burst sensor-day cases. Unfortunately, none of these real-flight cases separated current dev from the patched code (they both produced identical CDFs; the available real data does not exercise the exact broken transition patterns).
  6. Semi-synthetic validation: Because real flight data was insufficient to demonstrate the fix, semi-synthetic input data was derived from real 2025-09-27 MAG transitions to create cases that specifically exercise the T015/T016 failure modes. These cases demonstrated clear semantic differences between dev and the patched code, with dev producing incorrect row counts and the patched code matching expectations.

Relationship to the algorithm document

The MAG Science Algorithm Document (section 7.3.4) describes a six-step L1C process. The current code implements a simplified version of these steps that was developed collaboratively between SDC and the MAG team. This PR preserves that simplified implementation. Specifically:

Step 3 (Generate a new NM time sequence) is the area with the largest gap between the document and the code. The document describes a burst-driven approach: find the BM anchor time tC closest to the gap start tA, decimate BM timestamps, shift by (tA - tC), and trim. The current code instead generates gap timestamps arithmetically from the gap boundaries at the appropriate NM cadence, then interpolates burst data onto those timestamps.

Both approaches produce a regular NM time sequence that bridges the gap. The current approach is simpler and was explicitly approved by the MAG team. The key fix in this PR is that the cadence used for gap-fill timestamps now respects the actual NM rate declared in the gap metadata, rather than always using 0.5s (2 Hz). This aligns with the document's intent that "the rate depends on the rate of the NM data before the gap."

Step 5 (Identify remaining gaps with >1.1s threshold) is not changed by this PR. The current implementation uses a per-rate tolerance check rather than a fixed 1.1s threshold, which is arguably more correct for non-2 Hz rates.

The six bugs and their fixes

Fix 1: Rate-aware gap cadence (generate_missing_timestamps)

Bug: Gap-fill timestamps were always generated at 0.5s intervals (2 Hz) regardless of the actual normal mode rate. For non-2 Hz gaps, this was wrong by construction.

Fix: Read the rate from gap[2] (the third element of each gap tuple) and compute the cadence as int(1e9 / rate). Fall back to 0.5s if no rate is provided. Use integer arithmetic throughout to avoid float precision loss at TTJ2000-scale nanosecond timestamps.

Which tests this fixed: Covered directly by test_generate_missing_timestamps_uses_gap_rate() and the updated rate-aware case in test_generate_timeline(). In isolated validation replay, reverting this fix alone did not reintroduce a row-count delta once the other transition fixes were present, so it is best understood as a direct correctness fix for non-2 Hz gaps rather than the sole driver of a specific T015/T016 failure.

Spec alignment: This directly implements the document's instruction that the interpolated rate should depend on "the rate of the NM data before the gap."

Fix 2: 2D array indexing in interpolate_gaps

Bug: filled_norm_timeline is an (n, 8) array, but the gap-timeline extraction compared the entire 2D array against scalar timestamps without specifying a column. The boolean mask broadcasted across all 8 columns instead of just the epoch column (index 0).

Fix: Extract norm_epochs = filled_norm_timeline[:, 0] once at the top of the function and filter against it explicitly.

Which tests this fixed: No isolated validation delta reproduced when this line was reverted in the current patch set. This is still a real correctness fix: it makes the gap-timeline mask operate on the epoch column by construction instead of relying on non-epoch columns being numerically unrelated to TTJ2000 timestamps.

Fix 3: Off-by-one burst slice endpoint

Bug: burst_end was computed as an inclusive index (via min(len-1, ...)), but then used in a Python slice burst_epochs[burst_start:burst_end] which is end-exclusive, silently dropping the last intended burst sample from the interpolation window.

Fix: Make burst_end exclusive from the start: burst_end = min(len(burst_epochs), burst_gap_end + burst_buffer + 1).

Which tests this fixed: T016. In isolated replay, reverting this fix preserved row counts but changed the first mismatched interpolated vector at the burst-resume boundary for both mago and magi.

Fix 4: CIC filter edge clipping (linear_filtered)

Bug: After the CIC filter compensates for group delay, the filtered timestamp range is shorter than the input by delay samples at the tail. linear() then calls remove_invalid_output_timestamps(), which clips output timestamps to this shortened filtered range, dropping valid gap timestamps near the burst boundary.

Fix: Pass extrapolate=True to linear() inside linear_filtered(). This matches the behavior of the spec's reference CIC implementation, which uses InterpolatedUnivariateSpline (extrapolates by default).

Which tests this fixed: T016. In isolated replay, reverting this fix alone causes both T016-mago and T016-magi to lose the final interpolated row.

Interaction with T024: Adding extrapolate=True alone would cause a +1 row regression on T024 (burst-only). Fix 6 below compensates for this.

Fix 5: Spurious micro-gaps at Config mode transitions (_find_rate_segments)

Bug: When MAG transitions between modes via Config mode, the vectors_per_second metadata records a transition timestamp that does not perfectly align with where the cadence actually changes in the data. The existing find_all_gaps() split the timeline at the declared metadata timestamps, and find_gaps() misidentified the boundary misalignment as 1-sample phantom gaps. These caused extra rows in the T015 output.

Fix: New _find_rate_segments() function that walks each configured transition backward while the observed cadence already matches the new rate. This aligns rate segment boundaries to the actual cadence in the data, preventing spurious micro-gaps from being generated in the first place. Also extracts the existing 0.075 tolerance magic number into a shared GAP_TOLERANCE constant used by find_gaps().

Which tests this fixed: T015-magi (had 5 extra rows due to phantom gaps at the Config boundary). Reverting the walkback logic alone reintroduces that exact +5 row failure.

Why not a post-hoc filter: An earlier approach filtered out gaps shorter than 2 cadence periods after detection. This is simpler (around 5 lines vs around 35 lines) but cannot distinguish Config-transition artifacts from genuine single-sample data loss. The root-cause fix via _find_rate_segments() preserves detection of legitimate short gaps while eliminating only the artifacts.

Fix 6: Burst-only trailing-edge trim for T024

Bug: Fix 4 (extrapolate=True) allows interpolation slightly past the CIC-filtered range, which is correct for T016 but causes T024 (burst-only, no normal data) to produce one extra trailing sample beyond the expected output.

Fix: Detect the burst-only case by checking whether any samples in the filled timeline have ModeFlags.NORM status (has_norm_context). When there is no normal-mode context, shorten the burst coverage upper bound by one output cadence to compensate for the CIC filter delay. This removes the extra trailing sample without affecting the normal+burst path.

Which tests this fixed: T024-mago and T024-magi (regressed to 687 rows with Fix 4 alone, expected 686). Reverting this trim alone reproduces that exact +1 row failure.

Additional cleanup: Dead np.delete removal

The existing code called np.delete(filled_norm_timeline, timeline_index) without assigning the result. Since np.delete() returns a new array, this was a no-op. Removed.

Defensive additions

  • Guard clause in find_gaps(): Returns empty when timeline_data has fewer than 2 elements, preventing crashes on single-timestamp rate segments.
  • Integer dtype consistency: Gap arrays use dtype=np.int64 throughout to prevent silent float precision loss on nanosecond timestamps.
  • generate_timeline() bug fix: Update last_index when generated timestamps are empty, preventing duplicate epoch data copies with adjacent gaps.

Test changes

  • Un-skip T015 and T016: Removed the pytest.skip() guard in test_mag_l1c_validation.
  • Exact row-count assertion: Added an assert checking len(expected_output.index) == len(l1c["epoch"].data) before the per-row value checks. This catches off-by-one regressions immediately rather than silently passing when the output has extra rows.
  • Updated test_process_mag_l1c expectations: The existing unit test now expects rate-aware cadence (17 timestamps instead of 15 in the 4 Hz gap region, with updated flag indices).
  • New: test_find_all_gaps_uses_observed_transition_boundary(): Regression test for the _find_rate_segments() micro-gap fix.
  • New: test_generate_missing_timestamps_uses_gap_rate(): Regression test for rate-aware cadence, including the 2 Hz fallback.
  • New: Rate-aware gap fill case in test_generate_timeline(): Verifies 4 Hz gap timestamps are spaced at 0.25s.
  • New: Adjacent-boundary case in test_generate_timeline(): Verifies no duplicate timestamps when adjacent gaps share a real epoch boundary, with a mixed-rate transition.

Files changed

File Lines What
imap_processing/mag/l1c/mag_l1c.py +131/-32 Fixes 1-3, 5-6, guard clauses, cleanup, and helper docstrings required by numpydoc
imap_processing/mag/l1c/interpolation_methods.py +1/-1 Fix 4 (extrapolate=True)
imap_processing/tests/mag/test_mag_l1c.py +73/-6 Updated expectations + 4 new regression tests
imap_processing/tests/mag/test_mag_validation.py +1/-4 Remove skip guard + add exact row-count assertion
poetry.lock +10/-0 Required poetry-lock hook refresh

What this PR does NOT change

  • No rewrite of the gap-fill pipeline architecture
  • No GapFillPlan dataclass or BM-anchor/shift/trim step-3 implementation
  • No day-boundary / inherited-timeline work (T017-T019 scope)
  • No changes to public API or CLI interfaces
  • No timestamp precision fixes outside the gap-fill path

Validation results

Rerun locally on mag-l1c-fix:

imap_processing/tests/mag/test_mag_l1c.py: 18 passed in 0.84s
compare_mag_l1c_validation.py on T013/T014/T015/T016/T024 (mago,magi): all 10 cases PASS
pre-commit run --all-files: PASS

Exact row-count validation (previously skipped T015/T016 now pass):

Test Sensor Expected Got Status
T013 mago 1856 1856 MATCH
T013 magi 1856 1856 MATCH
T014 mago 1792 1792 MATCH
T014 magi 1792 1792 MATCH
T015 mago 2498 2498 MATCH
T015 magi 1538 1538 MATCH
T016 mago 3052 3052 MATCH
T016 magi 1195 1195 MATCH
T024 mago 686 686 MATCH
T024 magi 686 686 MATCH

sapols and others added 3 commits April 1, 2026 01:12
Six localized bug fixes to the existing MAG L1C gap-filling algorithm,
plus targeted regression tests. Keeps the current production algorithm
shape — no architectural rewrite.

Fixes:
- Rate-aware gap cadence in generate_missing_timestamps()
- 2D array indexing in interpolate_gaps() (column 0 explicitly)
- Off-by-one burst slice endpoint
- CIC filter edge clipping via extrapolate=True in linear_filtered()
- Spurious micro-gaps at Config transitions via _find_rate_segments()
- Burst-only trailing-edge trim for T024 exact row count
- Remove dead np.delete() no-op

Refs IMAP-Science-Operations-Center#1993

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Codex <noreply@openai.com>
@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 1, 2026

One scope clarification: My read is that this PR is enough to close the issue, but we haven't completed every historical MAG L1C TODO that came up in the discussion.

Relative to @maxinelasp’s earlier checklist:

  • done: one-input-file handling
  • done/effectively already present: CIC delay compensation and 2x buffer sizing
  • partial: beginning/end-of-day handling (synthetic day bounds are handled, but previous/next-day files are still not loaded)
  • not done here: using FILLVAL instead of zero placeholders

Relative to @alastairtree’s broader “L1C should” list:

  • this PR improves the validated transition/gap-fill behavior substantially
  • but it does not implement true previous/next-day context stitching
  • it does not add broader proof for out-of-order data handling
  • it does not claim comprehensive MAG OFF / no-science coverage

@sapols sapols marked this pull request as ready for review April 1, 2026 09:37
@sapols sapols requested a review from maxinelasp April 1, 2026 09:38
@sapols
Copy link
Copy Markdown
Author

sapols commented Apr 1, 2026

Another clarification: we're probably sweeping a lot under the rug by not counting day-boundary / inherited-timeline work (T017-T019) as in-scope for this ticket. E.g. Alastair's comment about gaps in the Dec 5-6 data appears to be directly related to L1C not treating leading/trailing burst-only intervals on mixed norm+burst days as fillable gaps. In those windows the parent burst L1B data does exist, but the current mixed-input L1C path collapses the output window to the existing normal-mode coverage instead of extending it across those burst-only spans.

@maxinelasp
Copy link
Copy Markdown
Contributor

@sapols Agreed, would you be able to open a follow-up ticket for the day boundary coverage if one does not exist?

@maxinelasp
Copy link
Copy Markdown
Contributor

Initial thoughts from reading through the PR description only:

  1. We should get a thumbs up from the MAG team around using extrapolation in the CIC filter only. Am I correct in understanding that these extrapolated data points are later clipped?
  2. for _find_rate_segements - sounds like a nice change. Does this still use the vecsec transitions attribute in any way?
  3. I'm a little confused by step 6 - what is the root of this extra timestamp getting produced? I worry this solution is a little too closely aligned with the validation test, will there ever be more than one extra timestamps?

Nice work! I'll review the code more thoroughly in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants