Motivation
-
We want to support running analysis methods on pre-acquired microscopy datasets — without needing a live instrument connection. Use cases include:
-1a. Segmentation of atoms/particles
-1b. Bayesian optimisation on spectrum images
-1c. Inpainting methods
-1d. Running LLMs on pre-acquired data via MCP
-
Ideally, users should not need to set up any servers. The server can be hosted centrally (e.g., on the Gatan PC), while users interact through a client-side notebook that handles data loading and exposes simple commands which displays them their data. Now they can build and test different methods on it.
- Question becomes -> user transfers the file on Gatan pc as the server sits there? is there a better way to handle this?
-
This will be handy for next Mic-hackathon
Supported Dataset Types
| Format |
Data Type |
Example |
.emd |
HAADF |
[example] |
.emd |
SI-EDX |
example |
.emd |
Single EDX spectrum |
example |
.dm4 |
HAADF |
— |
.dm4 |
SI-EELS |
— |
.dm4 |
Single EELS spectrum |
— |
.mrc |
4D-STEM |
— |
Scope for this issue: Start with all .emd cases.
Proposed API
Client-side behaviour
- User downloads a
.emd file locally
- User instantiates
ThermoDigitalTwin, passing the file path as a device_attribute
- User calls:
mic_proxy.get_preacquired_data(
file_type: Literal[".emd", ".dm4", ".mrc"],
data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
file_path: Literal["path/to/file"]
)
If data_type="HAADF":
- Single frame → returns image array + metadata (e.g.
pixel_size)
- Multiple frames → user can choose:
- Get the i-th frame
- Get the mean of all frames
- Get all frames as a stack
If data_type="SI-EDX":
- Returns the corresponding HAADF image + metadata (e.g.
pixel_size)
- User can then place the beam and acquire a spectrum:
mic_proxy.place_beam(coordinate: tuple) # place beam at (x, y)
mic_proxy.get_spectrum() # returns spectrum array + metadata (e.g. energy_offset, dispersion)
Server-side changes
Extend ThermoDigitalTwin with:
device_attributes
file_path: str — path to the pre-acquired dataset
New commands
def get_preacquired_data(
file_type: Literal[".emd", ".dm4", ".mrc"],
data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
) -> ...
Internal helpers
def _load_data(
file_type: Literal[".emd", ".dm4", ".mrc"],
data_type: Literal["HAADF", "SI-EDX", "SI-EELS", "Spectrum"]
) -> ...
Testing
- Upload a small representative file for each format to SciFiDatasets
- Write tests asserting:
- Correct array shape and dtype per data type
- Expected metadata fields are present and correctly typed (e.g.
pixel_size, energy_offset, dispersion)
- Correct reader behaviour — using either
pyTEMlib reader utils or scifireader directly
Open Questions
- What should be the standard return type? (numpy vs sidpy)
- Should beam placement be stateful or stateless?
- How to handle large datasets (lazy loading vs full load)?
Motivation
We want to support running analysis methods on pre-acquired microscopy datasets — without needing a live instrument connection. Use cases include:
-1a. Segmentation of atoms/particles
-1b. Bayesian optimisation on spectrum images
-1c. Inpainting methods
-1d. Running LLMs on pre-acquired data via MCP
Ideally, users should not need to set up any servers. The server can be hosted centrally (e.g., on the Gatan PC), while users interact through a client-side notebook that handles data loading and exposes simple commands which displays them their data. Now they can build and test different methods on it.
This will be handy for next Mic-hackathon
Supported Dataset Types
.emd.emd.emd.dm4.dm4.dm4.mrcProposed API
Client-side behaviour
.emdfile locallyThermoDigitalTwin, passing the file path as adevice_attributeIf
data_type="HAADF":pixel_size)If
data_type="SI-EDX":pixel_size)Server-side changes
Extend
ThermoDigitalTwinwith:device_attributesfile_path: str— path to the pre-acquired datasetNew commands
Internal helpers
Testing
pixel_size,energy_offset,dispersion)pyTEMlibreader utils orscifireaderdirectlyOpen Questions