-
Notifications
You must be signed in to change notification settings - Fork 830
Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON)
#4712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
marinegor
wants to merge
139
commits into
MDAnalysis:develop
Choose a base branch
from
marinegor:feature/mmcif
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
139 commits
Select commit
Hold shift + click to select a range
aa2a88f
Start working on MMCIF parser
marinegor 218cf43
Add first (not working) version of MMCIFReader and MMCIF topology parser
marinegor 7f78e02
Do some squashing
marinegor 6682d6e
Remove inherited docs
marinegor 817f3a0
Try improving the parsing
marinegor 3cc8c80
Try three independent loops over the model
marinegor f1bf325
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor d21c220
Add gemmi dependency
marinegor 2a1be15
necessary params
marinegor 77645e6
finished sorting atom attrs
marinegor 91e6942
add function for transformation into *idx
marinegor 9a0c086
oh damn seems to finally be working
marinegor 9c731df
remove TODOs
marinegor 8b40ec7
Remove debug prints
marinegor bdcbd73
Merge branch 'develop' into feature/mmcif
marinegor 401a4d3
try to pack things into separate class in utils?
marinegor 9c336bd
remove unnecessary functions
marinegor def88e4
copy all loops into separate functions
marinegor cabfd37
Move loops over structures into functions
marinegor 4c9d930
Move coordinate fetching into function for the coordinate reader as well
marinegor 184491a
Fix imports
marinegor 3de8565
Start adding documentation
marinegor ca6ebbb
Reference MMCIFParser in PDBParser
marinegor 45077ad
Add documentation for trajectory and topology parsers
marinegor 9a1a59a
Add mmcif tests
marinegor 27c10d6
Update format specifications
marinegor 950cfcf
Write simple tests
marinegor 8d1a8b5
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor ef29338
update github action with gemmi
marinegor caca17e
fix gemmi import errors
marinegor f0e49cc
add mmcif testfiles
marinegor b7ada7c
add mmcif to __all__
marinegor e80632c
add black instead of ruff
marinegor 10f3124
Merge remote-tracking branch 'origin/feature/mmcif' into feature/mmcif
marinegor 98353fe
fix function signature
marinegor 35fa187
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor e68fcce
Add documentation for mmcif coords
marinegor 263e9f1
expand documentation and type annotations
marinegor ba47d53
add invalid cif and MMCIF rst files
marinegor 9ffb6f2
add mmcif with invalid atom type
marinegor fcfc6c0
add biopython cif and fix invalid cif formatting
marinegor 0de720e
remove weird docs part
marinegor 236b286
fix fstring
marinegor b562115
replace version to 2.9.0
marinegor 816b23f
Merge remote-tracking branch 'upstream/develop' into feature/mmcif
marinegor 92ae164
update changelog
marinegor 88c64a3
move gemmi to optional deps
marinegor 59b7e29
fix issue with accidentally updated datafiles
marinegor f2c23c8
add mmcif to all
marinegor 776676e
Start working on MMCIF parser
marinegor 71e60f4
Add first (not working) version of MMCIFReader and MMCIF topology parser
marinegor 36b7125
Do some squashing
marinegor b058941
Remove inherited docs
marinegor ef30fa7
Try improving the parsing
marinegor 95572c1
Try three independent loops over the model
marinegor a8a9436
Add gemmi dependency
marinegor 6706bbe
necessary params
marinegor 8cf9da4
finished sorting atom attrs
marinegor f13156b
add function for transformation into *idx
marinegor dda981c
oh damn seems to finally be working
marinegor ebdf849
remove TODOs
marinegor 47043f6
Remove debug prints
marinegor 9770d7b
try to pack things into separate class in utils?
marinegor fd7f70d
remove unnecessary functions
marinegor 1493056
copy all loops into separate functions
marinegor 3d7fbb9
Move loops over structures into functions
marinegor 9b9286e
Move coordinate fetching into function for the coordinate reader as well
marinegor b8f3c04
Fix imports
marinegor 0f38a2d
Start adding documentation
marinegor b915aab
Reference MMCIFParser in PDBParser
marinegor 0d61248
Add documentation for trajectory and topology parsers
marinegor 34d76ca
Add mmcif tests
marinegor b242aa5
Update format specifications
marinegor 4fc3a78
Write simple tests
marinegor 14fa756
fix actions
marinegor e3a9a1f
fix gemmi import errors
marinegor d492b4e
add mmcif testfiles
marinegor 1880e4a
add mmcif to __all__
marinegor 927d7a0
add black instead of ruff
marinegor ad0f0be
fix function signature
marinegor e03c3e5
Add documentation for mmcif coords
marinegor 4d79205
expand documentation and type annotations
marinegor 32d7cf9
add invalid cif and MMCIF rst files
marinegor 0df8c3a
add mmcif with invalid atom type
marinegor 05c6ea1
add biopython cif and fix invalid cif formatting
marinegor 88dab79
remove weird docs part
marinegor a82fe52
fix fstring
marinegor e3f1714
replace version to 2.9.0
marinegor db46016
fix actions
marinegor 32cd103
fix datafiles
marinegor 805089e
add mmcif to all
marinegor 55c3dbb
add mmcif to coordinates and topology modules
marinegor cd201d0
update docs following yuxuanzhuang comments
marinegor 81f0b5b
merge remote
marinegor 22d1cca
add linked issues and prs to changelog
marinegor d1ba434
remove mmcif files from black ignore
marinegor a03b56f
add tests for multimodel file warnings
marinegor bd4c255
add tests for cryst1 warnings
marinegor 3d61dc5
black
marinegor aed9b54
add invalid cif file itself
marinegor 53c51f4
format datafiles with black
marinegor 3e0324c
merge develop
marinegor 205c910
add tests for 1BD2 and other files mentioned in discussion
marinegor f9f7912
enable linting of mmcif-related files
marinegor 9a90316
add test files
marinegor 1c5a549
add short version of test files instead
marinegor dfc10e6
fix short file versions that coordinate test passes
marinegor aae46c8
wip
marinegor 9cf4027
wip: short versions work for some reason
marinegor 522e125
* Fix MMCIF topology creation
PardhavMaradani 0b0ec81
add slightly more smoke tests
marinegor b3d7c1c
update changelog
marinegor 801d85f
remove fixmes and move some tests to topology tests
marinegor bdd070e
black formatter
marinegor 6b2c6c6
match formats in toplogy and coordinate parsers for mmcif
marinegor 1a7f607
replace gemmi.get_structure
ljwoods2 157c365
Revert "replace gemmi.get_structure"
ljwoods2 1d01a3f
add input fmt tests
ljwoods2 2020484
topology format kwarg bug
ljwoods2 d362f91
error handling tweaks
ljwoods2 959d78b
remove tmp test files
ljwoods2 15441f0
apply black
marinegor e2e097f
fix changelog
marinegor bc89c0b
add one more issue to changelog
marinegor adcca0b
Merge branch 'develop' into feature/mmcif
BradyAJohnston f17062c
address minor fixes
BradyAJohnston 25424a0
Apply suggestions from code review
BradyAJohnston b2f6abc
formatting and doc cleanup
BradyAJohnston 2b64581
de-duplicate _read_gemmi_structure
BradyAJohnston d1d8467
properly expose parsers in topology/init.py
BradyAJohnston bb857a8
undo unnecessary tpr changes
BradyAJohnston b026179
refactor main zip into simpler loop
BradyAJohnston 547eed7
match case & logging
BradyAJohnston 95600b1
fix gemmi docs linking
BradyAJohnston e053cd6
import Structure behind TYPE_CHECKING
BradyAJohnston a7ea152
fix type imports
BradyAJohnston 40d7747
cleanup / fix docs
BradyAJohnston 6d4d9a4
black format
BradyAJohnston 4879d6b
Apply suggestions from code review
BradyAJohnston File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*- | ||
| # vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 | ||
| # | ||
| """ | ||
| MMCIF structure files in MDAnalysis --- :mod:`MDAnalysis.coordinates.MMCIF` | ||
| =========================================================================== | ||
|
|
||
| .. versionadded:: 2.11.0 | ||
|
|
||
| MDAnalysis reads coordinates from MMCIF (macromolecular Crystallographic | ||
| Information File) files, also known as PDBx/mmCIF format, using the | ||
| `gemmi <https://gemmi.readthedocs.io>`_ library as a backend. MMCIF is a | ||
| more modern and flexible alternative to the PDB format, capable of storing | ||
| detailed structural and experimental data about biological macromolecules. | ||
|
|
||
| MMCIF files use a structured, tabular format with key-value pairs to store | ||
| both coordinate and atom information. The format supports multiple | ||
| models/frames, though this implementation currently only reads the first | ||
| model and provides warning messages for multi-model files. | ||
|
|
||
| The reader automatically detects if the structure contains placeholder unit | ||
| cell information (usually the case for cryoEM structures, where cell | ||
| parameters are (1, 1, 1, 90, 90, 90)) and sets dimensions to ``None`` | ||
| in that case. | ||
|
|
||
| Basic usage | ||
| ----------- | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import MDAnalysis as mda | ||
|
|
||
| u = mda.Universe("structure.cif") | ||
|
|
||
| # or from a compressed file | ||
| u = mda.Universe("structure.cif.gz") | ||
|
|
||
| See Also | ||
| -------- | ||
| * `wwPDB MMCIF Resources <http://mmcif.wwpdb.org>`_ | ||
| * `Gemmi library documentation <https://gemmi.readthedocs.io>`_ | ||
|
|
||
| Classes | ||
| ------- | ||
|
|
||
| .. autoclass:: MMCIFReader | ||
| :members: | ||
| :inherited-members: | ||
|
|
||
| """ | ||
|
|
||
| import logging | ||
| import warnings | ||
| from pathlib import Path | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| import numpy as np | ||
|
|
||
| from ..lib import util | ||
| from . import base | ||
|
|
||
| if TYPE_CHECKING: | ||
| from gemmi import Model, Structure | ||
|
|
||
| try: | ||
| import gemmi | ||
|
|
||
|
BradyAJohnston marked this conversation as resolved.
|
||
| HAS_GEMMI = True | ||
| except ImportError: | ||
| HAS_GEMMI = False | ||
|
|
||
| logger = logging.getLogger("MDAnalysis.coordinates.MMCIF") | ||
|
|
||
|
|
||
| def _read_gemmi_structure(filename: str | Path) -> "Structure": | ||
| # This function exists because of some lacking methods in the gemmi Python API. | ||
| # Within gemmi in C++, one can call `read_structure` and in-memory, string, and filepath | ||
| # arguments will all be accepted: | ||
| # https://github.com/project-gemmi/gemmi/blob/4416e298f204b7b57bf5b3051d7efd4fe02957cf/include/gemmi/mmread.hpp#L86 | ||
|
|
||
| # However, for MDA to similarly accept common input types like streams (open File-like objs and StringIO objs) | ||
| # as well as pathlib.Path() objects, we have to use the Python API methods available currently (as of 0.7.3) | ||
| # with a string as a common target for all input types. | ||
| # For this, we call gemmi.cif.read_string (https://gemmi.readthedocs.io/en/latest/cif.html#reading) to handle CIF | ||
| # strings and gemmi.read_pdb_string to handle PDB strings (no one method can handle both formats currently Py-side) | ||
|
|
||
| # openany() is called instead of passing file paths (when available) differently from streams; | ||
| # even though reading the file into a string is less efficient, this is easier to maintain. | ||
|
|
||
| # If the gemmi Python API is extended, this function can be simplified/removed and replaced with something like | ||
| # gemmi.read_structure | ||
| with util.openany(filename) as f: | ||
| content_as_str = f.read() | ||
| try: | ||
| # String -> Doc -> Block -> Structure | ||
| # making Structure from first Block in Document as is done internally in gemmi: | ||
| # https://github.com/project-gemmi/gemmi/blob/4416e298f204b7b57bf5b3051d7efd4fe02957cf/include/gemmi/mmcif.hpp#L32 | ||
| return gemmi.make_structure_from_block( | ||
| gemmi.cif.read_string(content_as_str)[0] | ||
| ) | ||
| except ValueError as e: | ||
| try: | ||
| return gemmi.read_pdb_string(content_as_str) | ||
| except ValueError: | ||
| raise e | ||
|
|
||
|
|
||
| def _get_coordinates(model: "Model") -> np.ndarray: | ||
| """Get coordinates of all atoms in the `gemmi.Model` object. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| model | ||
| input ``gemmi.Model``, e.g. ``gemmi.read_structure('file.cif')[0]`` | ||
|
|
||
| Returns | ||
| ------- | ||
| np.ndarray, shape [n, 3], where ``n`` is the number of atoms in the structure. | ||
| """ | ||
| return np.array( | ||
| [[*at.pos.tolist()] for chain in model for res in chain for at in res] | ||
|
BradyAJohnston marked this conversation as resolved.
|
||
| ) | ||
|
|
||
|
|
||
| class MMCIFReader(base.SingleFrameReaderBase): | ||
| """Reads from an MMCIF file using :mod:`gemmi` as a backend. | ||
|
|
||
| Notes | ||
| ----- | ||
|
|
||
| If the structure represents an ensemble, only the first structure in the ensemble | ||
| is read here (and a warning is thrown). Also, if the structure has a placeholder "CRYST1" | ||
| record (1, 1, 1, 90, 90, 90), it's set to ``None`` instead. | ||
|
|
||
| .. versionadded:: 2.11.0 | ||
| """ | ||
|
|
||
| format = ["cif", "cif.gz", "mmcif", "mmcif.gz"] | ||
| units = {"time": None, "length": "Angstrom"} | ||
|
|
||
| def _read_first_frame(self): | ||
| structure = self._get_structure() | ||
| cell_dims = np.array( | ||
| [ | ||
| getattr(structure.cell, name) | ||
| for name in ("a", "b", "c", "alpha", "beta", "gamma") | ||
| ] | ||
| ) | ||
| if len(structure) > 1: | ||
| wmsg = ( | ||
| f"File {self.filename} has {len(structure)} models, " | ||
| "but only the first one will be read" | ||
| ) | ||
| warnings.warn(wmsg) | ||
| logger.warning(wmsg) | ||
|
|
||
| model = structure[0] | ||
| coords = _get_coordinates(model) | ||
| self.n_atoms = len(coords) | ||
| self.ts = self._Timestep.from_coordinates(coords, **self._ts_kwargs) | ||
| if np.allclose(cell_dims, np.array([1.0, 1.0, 1.0, 90.0, 90.0, 90.0])): | ||
| wmsg = ( | ||
| "1 A^3 CRYST1 record," | ||
| " this is usually a placeholder." | ||
| " Unit cell dimensions will be set to None." | ||
| ) | ||
| warnings.warn(wmsg) | ||
| logger.warning(wmsg) | ||
| self.ts.dimensions = None | ||
| else: | ||
| self.ts.dimensions = cell_dims | ||
| self.ts.frame = 0 | ||
|
|
||
| def _get_structure(self): | ||
| return _read_gemmi_structure(self.filename) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.