Skip to content

perf: O(N) disk I/O performance bottleneck in DICOM header extraction #28

@chinmayy777

Description

@chinmayy777

Description
There is a significant performance bottleneck when processing large medical datasets. The get_dicom_header() function in pyaslreport/main.py currently attempts to parse every single file in a target directory to build an array of valid DICOMs, before finally returning the header of the first file. For clinical datasets containing thousands of DICOM slices, this causes massive, unnecessary disk I/O and memory overhead.

Steps to Reproduce

  1. Trigger a report generation using a directory containing a large number of DICOM files (e.g., 3,000+ slices).
  2. Monitor the execution time and memory usage.
  3. Observe the massive delay caused by pydicom.dcmread iterating over every file.

Expected Behavior
The function should return the dcm_header immediately upon successfully parsing the very first valid DICOM file, turning an O(N) operation into an O(1) operation (best case).

Actual Behavior
The loop iterates through and parses every single file in the directory before returning.

Environment

  • OS: Ubuntu 24.04.4 LTS
  • Python Version: 3.11
  • Package: pyaslreport

Additional Context
I have already written an optimized fix for this that returns the header immediately and stops the loop. I will link a PR shortly. #29

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions