4.8.1.4. Secondary structure assignment (helix, sheet and loop) — MDAnalysis.analysis.dssp
- Author:
- Egor Marin 
- Year:
- 2024 
- Copyright:
- LGPL v2.1+ 
New in version 2.8.0.
The module contains code to build hydrogend bond contact map,
and use it to assign protein secondary structure (DSSP).
This module uses the python version of the original algorithm [1], re-implemented by @ShintaroMinami and available under MIT license from ShintaroMinami/PyDSSP.
Note
This implementation does not discriminate different types of beta-sheets, as well as different types of helices, meaning you will get \(3_{10}\) helices and π-helices labelled as “helix” too.
Using original pydssp
The default implementation uses the original pydssp (v.0.9.0) code, rewritten
without usage of the einops library and hence having no dependencies. If you want
to explicitly use pydssp (or its particular version), install it to your
current environment with python -m pip install pydssp. Please note that the
way MDAnalysis uses pydssp does not support pydssp ‘s capability for batch
processing or its use of the pytorch library.
When using this module in published work please cite [1].
References
4.8.1.4.1. Example applications
4.8.1.4.1.1. Assigning secondary structure of a PDB file
In this example we will simply print a string representing a protein’s secondary structure.
from MDAnalysis.tests.datafiles import PDB
from MDAnalysis.analysis.dssp import DSSP
u = mda.Universe(PDB)
s = ''.join(DSSP(u).run().results.dssp[0])
print(s)
4.8.1.4.1.2. Calculating average secondary structure of a trajectory
Here we take a trajectory and calculate its average secondary structure, i.e. assign a secondary structure label ‘X’ to a residue if most of the frames in the trajectory got assigned the ‘X’ label.
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
long_run = DSSP(u).run()
mean_secondary_structure = translate(long_run.results.dssp_ndarray.mean(axis=0))
print(''.join(mean_secondary_structure))
Running this code produces
'--EEEE----------HHHHHHH----EE----HHHHH------HHHHHHHHHH------HHHHHHHHHHH---------EEEE-----HHHHHHHHH------EEEEEE--HHHHHH----EE--------EE---E----------------------HHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEE------HHHHHHHHH--'
4.8.1.4.1.3. Find parts of the protein that maintain their secondary structure during simulation
In this example, we will find residue groups that maintain their secondary structure along the simulation, and have some meaningful (‘E’ or ‘H’) secondary structure during more than set threshold fraction of frames. We will call these residues “persistent”, for clarity, and label them according to the structure that they maintain during the run:
from MDAnalysis.analysis.dssp import DSSP, translate
from MDAnalysisTests.datafiles import TPR, XTC
u = mda.Universe(TPR, XTC)
threshold = 0.8
long_run = DSSP(u).run()
persistent_residues = translate(
    long_run
    .results
    .dssp_ndarray
    .mean(axis=0) > threshold
)
print(''.join(persistent_residues)[:20])
Running this code produces
'--EEEE----------HHHH'
4.8.1.4.2. Analysis classes
- class MDAnalysis.analysis.dssp.dssp.DSSP(atoms: Universe | AtomGroup, guess_hydrogens: bool = True, *, heavyatom_names: tuple[str] = ('N', 'CA', 'C', 'O O1 OT1'), hydrogen_name: str = 'H HN HT1 HT2 HT3')[source]
- Assign secondary structure using the DSSP algorithm. - Analyze a selection containing a protein and assign secondary structure using the Kabsch-Sander algorithm [1]. Only a subset of secondary structure categories are implemented: - ‘H’ represents a generic helix (α-helix, π-helix or \(3_{10}\) helix) 
- ‘E’ represents ‘extended strand’, participating in beta-ladder (parallel or antiparallel) 
- ‘-’ represents unordered part (“loop”) 
 - The implementation was taken from the pydssp package (v. 0.9.0) https://github.com/ShintaroMinami/PyDSSP by Shintaro Minami under the MIT license. - Warning - For DSSP to work properly, your atoms must represent a protein. The hydrogen atom bound to the backbone nitrogen atom is matched by name as given by the keyword argument hydrogen_atom. There may only be a single backbone nitrogen hydrogen atom per residue; the one exception is proline, for which there should not exist any such hydrogens. The default value of hydrogen_atom should handle the common naming conventions in the PDB and in force fields but if you encounter an error or unusual results during your run, try to figure out how to select the correct hydrogen atoms and report an issue in the MDAnalysis issue tracker. - Parameters:
- atoms (Union[Universe, AtomGroup]) – input Universe or AtomGroup. In both cases, only protein residues will be chosen prior to the analysis via select_atoms(‘protein’). Heavy atoms of the protein are then selected by name heavyatom_names, and hydrogens are selected by name hydrogen_name. 
- guess_hydrogens (bool, optional) – whether you want to guess hydrogens positions, by default - True. Guessing is made assuming perfect 120 degrees for all bonds that N atom makes, and a N-H bond length of 1.01 A. If- guess_hydrogensis False, hydrogen atom positions on N atoms will be parsed from the trajectory, except for the “hydrogen” atom positions on PRO residues, and an N-terminal residue.
- heavyatom_names (tuple[str], default ("N", "CA", "C", "O O1 OT1")) – selection names that will be used to select “N”, “CA”, “C” and “O” atom coordinates for the secondary structure determination. The last string contains multiple values for “O” to account for C-term residues. 
- hydrogen_name (str, default "H HN HT1 HT2 HT3") – - This selection should only select a single hydrogen atom in each residue (except proline), namely the one bound to the backbone nitrogen. - Note - To work with different hydrogen-naming conventions by default, the default selection is broad but if hydrogens are incorrectly selected (e.g., a - ValueErroris raised) you must customize hydrogen_name for your specific case.
 
- Raises:
- ValueError – if - guess_hydrogensis True but some non-PRO hydrogens are missing.
 - Examples - For example, you can assign secondary structure for a single PDB file: - >>> from MDAnalysis.analysis.dssp import DSSP >>> from MDAnalysisTests.datafiles import PDB >>> import MDAnalysis as mda >>> u = mda.Universe(PDB) >>> run = DSSP(u).run() >>> print("".join(run.results.dssp[0, :20])) --EEEEE-----HHHHHHHH - The - results.dsspholds the time series of assigned secondary structure, with one character for each residue.- (Note that for displaying purposes we only print the first 20 residues of frame 0 with - run.results.dssp[0, :20]but one would typically look at all residues- run.results.dssp[0].)- The - results.dssp_ndarrayattribute holds a- (n_frames, n_residues, 3)shape ndarray with a one-hot encoding of loop ‘-’ (index 0), helix ‘H’ (index 1), and sheet ‘E’ (index 2), respectively for each frame of the trajectory. It can be used to compute, for instance, the average secondary structure:- >>> from MDAnalysis.analysis.dssp import translate, DSSP >>> from MDAnalysisTests.datafiles import TPR, XTC >>> u = mda.Universe(TPR, XTC) >>> run = DSSP(u).run() >>> mean_secondary_structure = translate(run.results.dssp_ndarray.mean(axis=0)) >>> print(''.join(mean_secondary_structure)[:20]) -EEEEEE------HHHHHHH - New in version 2.8.0: Enabled parallel execution with the - multiprocessingand- daskbackends; use the new method- get_supported_backends()to see all supported backends.- results.dssp
- Contains the time series of the DSSP assignment as a - numpy.ndarrayarray of shape- (n_frames, n_residues)where each row contains the assigned secondary structure character for each residue (whose corresponding resid is stored in- results.resids). The three characters are [‘H’, ‘E’, ‘-’] and representi alpha-helix, sheet and loop, respectively.
 - results.dssp_ndarray
- Contains the one-hot encoding of the time series of the DSSP assignment as a - numpy.ndarrayBoolean array of shape- (n_frames, n_residues, 3)where for each residue the encoding is stored as- (3,)shape- numpy.ndarrayof Booleans so that- Trueat index 0 represents loop (‘-‘),- Trueat index 1 represents helix (‘H’), and- Trueat index 2 represents sheet ‘E’.- See also 
 - results.resids
- A - numpy.ndarrayof length- n_residuesthat contains the residue IDs (resids) for the protein residues that were assigned a secondary structure.
 - classmethod get_supported_backends()[source]
- Tuple with backends supported by the core library for a given class. User can pass either one of these values as - backend=...to- run()method, or a custom object that has- applymethod (see documentation for- run()):- ‘serial’: no parallelization 
- ‘multiprocessing’: parallelization using multiprocessing.Pool 
- ‘dask’: parallelization using dask.delayed.compute(). Requires installation of mdanalysis[dask] 
 - If you want to add your own backend to an existing class, pass a - backends.BackendBasesubclass (see its documentation to learn how to implement it properly), and specify- unsupported_backend=True.- Returns:
- names of built-in backends that can be used in - run(backend=...)()
- Return type:
 - New in version 2.8.0. 
 - property parallelizable
- Boolean mark showing that a given class can be parallelizable with split-apply-combine procedure. Namely, if we can safely distribute - _single_frame()to multiple workers and then combine them with a proper- _conclude()call. If set to- False, no backends except for- serialare supported.- Note - If you want to check parallelizability of the whole class, without explicitly creating an instance of the class, see - _analysis_algorithm_is_parallelizable. Note that you setting it to other value will break things if the algorithm behind the analysis is not trivially parallelizable.- Returns:
- if a given - AnalysisBasesubclass instance is parallelizable with split-apply-combine, or not
- Return type:
 - New in version 2.8.0. 
 - run(start: int | None = None, stop: int | None = None, step: int | None = None, frames: Iterable | None = None, verbose: bool | None = None, n_workers: int | None = None, n_parts: int | None = None, backend: str | BackendBase | None = None, *, unsupported_backend: bool = False, progressbar_kwargs=None)
- Perform the calculation - Parameters:
- start (int, optional) – start frame of analysis 
- stop (int, optional) – stop frame of analysis 
- step (int, optional) – number of frames to skip between each analysed frame 
- frames (array_like, optional) – - array of integers or booleans to slice trajectory; - framescan only be used instead of- start,- stop, and- step. Setting both- framesand at least one of- start,- stop,- stepto a non-default value will raise a- ValueError.- New in version 2.2.0. 
- verbose (bool, optional) – Turn on verbosity 
- progressbar_kwargs (dict, optional) – ProgressBar keywords with custom parameters regarding progress bar position, etc; see - MDAnalysis.lib.log.ProgressBarfor full list. Available only for- backend='serial'
- backend (Union[str, BackendBase], optional) – - By default, performs calculations in a serial fashion. Otherwise, user can choose a backend: - stris matched to a builtin backend (one of- serial,- multiprocessingand- dask), or a- MDAnalysis.analysis.results.BackendBasesubclass.- New in version 2.8.0. 
- n_workers (int) – - positive integer with number of workers (processes, in case of built-in backends) to split the work between - New in version 2.8.0. 
- n_parts (int, optional) – - number of parts to split computations across. Can be more than number of workers. - New in version 2.8.0. 
- unsupported_backend (bool, optional) – - if you want to run your custom backend on a parallelizable class that has not been tested by developers, by default False - New in version 2.8.0. 
 
 - Changed in version 2.2.0: Added ability to analyze arbitrary frames by passing a list of frame indices in the frames keyword argument. - Changed in version 2.5.0: Add progressbar_kwargs parameter, allowing to modify description, position etc of tqdm progressbars - Changed in version 2.8.0: Introduced - backend,- n_workers,- n_partsand- unsupported_backendkeywords, and refactored the method logic to support parallelizable execution.
 
4.8.1.4.3. Functions
- MDAnalysis.analysis.dssp.dssp.assign(coord: ndarray) ndarray[source]
- Assigns secondary structure for a given coordinate array, either with or without assigned hydrogens - Parameters:
- coord (np.ndarray) – input coordinates in either (n, 4, 3) or (n, 5, 3) shape, without or with hydrogens, respectively. Second dimension k represents (N, CA, C, O) atoms coordinates (if k=4), or (N, CA, C, O, H) coordinates (when k=5). 
- Returns:
- output (n,) array with one-hot labels in C3 notation (‘-’, ‘H’, ‘E’), representing loop, helix and sheet, respectively. 
- Return type:
- np.ndarray 
 - New in version 2.8.0. 
- MDAnalysis.analysis.dssp.dssp.translate(onehot: ndarray) ndarray[source]
- Translate a one-hot encoding summary into char-based secondary structure assignment. - One-hot encoding corresponds to C3 notation: ‘-’, ‘H’, ‘E’ are loop, helix and sheet, respectively. Input array must have its last axis of shape 3: - (n_residues, 3)or- (n_frames, n_residues, 3)- Examples - from MDAnalysis.analysis.dssp import translate import numpy as np # encoding 'HE-' onehot = np.array([[False, True, False], # 'H' [False, False, True], # 'E' [True, False, False]]) # '-' ''.join(translate(onehot)) print(''.join(translate(onehot))) - Running this code produces - HE- - Parameters:
- onehot (np.ndarray) – input array of one-hot encoding in (‘-’, ‘H’, ‘E’) order 
- Returns:
- array of ‘-’, ‘H’ and ‘E’ symbols with secondary structure 
- Return type:
- np.ndarray 
 - New in version 2.8.0.