Bug report: MultiDBD.get() returns fewer data points when a second MultiDBD instance (different file type) is created first in the same Python session
Library: dbdreader v0.5.8
Python: 3.14.2
OS: Linux
Summary
Creating a MultiDBD instance for .dbd (flight-computer) files and calling .get() on it before creating a separate MultiDBD instance for .ebd (science-computer) files causes the .ebd instance to return significantly fewer data points — consistently and reproducibly — than when the .ebd instance is created first.
This leads to silently incomplete data and, because the effect size varies somewhat between Python sessions, non-reproducible processing pipelines.
Minimal reproducible example
import dbdreader
DATA = "/path/to/glider/hd/" # contains echo*.dbd and echo*.ebd
# Case A: load EBD first, then DBD
gl_ebd = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_a, _ = gl_ebd.get("sci_ctd41cp_timestamp")
gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd.get("m_gps_lat")
print(f"Case A (EBD first): {len(t_a):,} points") # → 1,038,686
# Case B: load DBD first, then EBD (typical script order)
gl_dbd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd2.get("m_gps_lat")
gl_ebd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_b, _ = gl_ebd2.get("sci_ctd41cp_timestamp")
print(f"Case B (DBD first): {len(t_b):,} points") # → 1,028,677 (≈ 10k fewer)
Observed behaviour
| Scenario |
sci_ctd41cp_timestamp length |
Notes |
| EBD only (no DBD in session) |
1,038,686 |
consistent across runs |
| EBD after DBD loaded |
1,028,677 |
consistent within a single Python session, but the exact count varies between separate Python sessions (observed range: ~964k – ~1,036k) |
Both MultiDBD instances are created from the same 281 .ebd files; len(gl.filenames) reports 281 in all cases.
Impact
A data-processing script that (naturally) loads GPS positions from .dbd files before reading CTD data from .ebd files will receive up to ~74,000 fewer data points than if the loading order is reversed. In practice we observed:
- Downstream dataset produced from the "DBD-first" script had ~964 k time steps
- The same script with "EBD-first" order produced ~1,025 k time steps
- The extra ~61 k points recovered by reordering were not QC failures — they were valid science data
Because the magnitude of the shortfall varies between Python sessions (likely depending on whether certain .ccc cache files have already been decompressed to .cac in a prior run), the pipeline is non-reproducible: re-running the same script on the same input files can yield different output files.
Suspected cause
The issue appears to involve shared state between MultiDBD instances. Candidate locations in the source:
-
DBDCache.CACHEDIR (class-level attribute) — This is shared across all MultiDBD instances. Reading .dbd files first triggers decompress_file() calls that convert .ccc → .cac files. On subsequent .ebd reads, the newly-present .cac files change which files pass the _safely_open_dbd_file logic, potentially altering the set of files classified as "ok" vs "failed".
-
DBDPatternSelect.cache = {} (class-level dict) — This timestamp-keyed cache is shared across all instances and could mix up file-open-time metadata between DBD and EBD instances.
-
DBDCache decompression race / state — .ccc → .cac decompression during one instance's __init__ modifies the filesystem in a way that changes what the next instance finds.
Workaround
Load .ebd (science) files before .dbd (flight) files in the same Python session. After this reordering, MultiDBD.get() gives consistent, reproducible results across repeated runs.
Steps to confirm
# Verify consistency when EBD is always first:
for _ in range(5):
gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t, _ = gl.get("sci_ctd41cp_timestamp")
print(len(t)) # prints 1,038,686 every time
# Verify inconsistency when DBD comes first:
gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
gl_dbd.get("m_gps_lat")
for _ in range(3):
gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t, _ = gl.get("sci_ctd41cp_timestamp")
print(len(t)) # same value within a session, but differs between sessions
H
Bug report:
MultiDBD.get()returns fewer data points when a secondMultiDBDinstance (different file type) is created first in the same Python sessionLibrary:
dbdreaderv0.5.8Python: 3.14.2
OS: Linux
Summary
Creating a
MultiDBDinstance for.dbd(flight-computer) files and calling.get()on it before creating a separateMultiDBDinstance for.ebd(science-computer) files causes the.ebdinstance to return significantly fewer data points — consistently and reproducibly — than when the.ebdinstance is created first.This leads to silently incomplete data and, because the effect size varies somewhat between Python sessions, non-reproducible processing pipelines.
Minimal reproducible example
Observed behaviour
sci_ctd41cp_timestamplengthBoth
MultiDBDinstances are created from the same 281.ebdfiles;len(gl.filenames)reports 281 in all cases.Impact
A data-processing script that (naturally) loads GPS positions from
.dbdfiles before reading CTD data from.ebdfiles will receive up to ~74,000 fewer data points than if the loading order is reversed. In practice we observed:Because the magnitude of the shortfall varies between Python sessions (likely depending on whether certain
.ccccache files have already been decompressed to.cacin a prior run), the pipeline is non-reproducible: re-running the same script on the same input files can yield different output files.Suspected cause
The issue appears to involve shared state between
MultiDBDinstances. Candidate locations in the source:DBDCache.CACHEDIR(class-level attribute) — This is shared across allMultiDBDinstances. Reading.dbdfiles first triggersdecompress_file()calls that convert.ccc→.cacfiles. On subsequent.ebdreads, the newly-present.cacfiles change which files pass the_safely_open_dbd_filelogic, potentially altering the set of files classified as"ok"vs"failed".DBDPatternSelect.cache = {}(class-level dict) — This timestamp-keyed cache is shared across all instances and could mix up file-open-time metadata between DBD and EBD instances.DBDCachedecompression race / state —.ccc→.cacdecompression during one instance's__init__modifies the filesystem in a way that changes what the next instance finds.Workaround
Load
.ebd(science) files before.dbd(flight) files in the same Python session. After this reordering,MultiDBD.get()gives consistent, reproducible results across repeated runs.Steps to confirm
H