Skip to content

Modify ThermoDigitalTwin to support reading preacquired datasets #62

@utkarshp1161

Description

@utkarshp1161

Motivation

  1. We want to support running analysis methods on pre-acquired microscopy datasets — without needing a live instrument connection. Use cases include:
    -1a. Segmentation of atoms/particles
    -1b. Bayesian optimisation on spectrum images
    -1c. Inpainting methods
    -1d. Running LLMs on pre-acquired data via MCP

  2. Ideally, users should not need to set up any servers. The server can be hosted centrally (e.g., on the Gatan PC), while users interact through a client-side notebook that handles data loading and exposes simple commands which displays them their data. Now they can build and test different methods on it.

    • Question becomes -> user transfers the file on Gatan pc as the server sits there? is there a better way to handle this?
  3. This will be handy for next Mic-hackathon

Supported Dataset Types

Format Data Type Example
.emd HAADF [example]
.emd SI-EDX example
.emd Single EDX spectrum example
.dm4 HAADF
.dm4 SI-EELS
.dm4 Single EELS spectrum
.mrc 4D-STEM

Scope for this issue: Start with all .emd cases.

Proposed API

Client-side behaviour

  1. User downloads a .emd file locally
  2. User instantiates ThermoDigitalTwin, passing the file path as a device_attribute
  3. User calls:
mic_proxy.get_preacquired_data(
    file_type: Literal[".emd", ".dm4", ".mrc"],
    data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
    file_path: Literal["path/to/file"]

)

If data_type="HAADF":

  • Single frame → returns image array + metadata (e.g. pixel_size)
  • Multiple frames → user can choose:
    • Get the i-th frame
    • Get the mean of all frames
    • Get all frames as a stack

If data_type="SI-EDX":

  • Returns the corresponding HAADF image + metadata (e.g. pixel_size)
  • User can then place the beam and acquire a spectrum:
mic_proxy.place_beam(coordinate: tuple)  # place beam at (x, y)
mic_proxy.get_spectrum()                 # returns spectrum array + metadata (e.g. energy_offset, dispersion)

Server-side changes

Extend ThermoDigitalTwin with:

device_attributes

  • file_path: str — path to the pre-acquired dataset

New commands

def get_preacquired_data(
    file_type: Literal[".emd", ".dm4", ".mrc"],
    data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
) -> ...

Internal helpers

def _load_data(
    file_type: Literal[".emd", ".dm4", ".mrc"],
    data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
) -> ...

Testing

  • Upload a small representative file for each format to SciFiDatasets
  • Write tests asserting:
    • Correct array shape and dtype per data type
    • Expected metadata fields are present and correctly typed (e.g. pixel_size, energy_offset, dispersion)
    • Correct reader behaviour — using either pyTEMlib reader utils or scifireader directly

Open Questions

  • What should be the standard return type? (numpy vs sidpy)
  • Should beam placement be stateful or stateless?
  • How to handle large datasets (lazy loading vs full load)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions