Prop3D.common package

Submodules

Prop3D.common.AbstractStructure module

class Prop3D.common.AbstractStructure.AbstractStructure(name: str, file_mode: str = 'r', coarse_grained: bool = False)

Bases: object

A base structure class that holds default methods for dealing with structures. Subclasses can be created to use different protein libraires, e.g. BioPython, BioTite, our own HDF/HSDS distributed stucture.

Parameters:

name (str) – Name of proten, e.g. PDB or CATH id
file_mode (str (r, w, r+, w+, etc)) – Open file for reading or writing. Defualt is just reading, no methods will affect underlying file
coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.

copy(empty: bool = False) → _Self

Create a deep copy of current structure.

Parameters:: empty ((deprecated)) – Don’t copy features

deep_copy_feature(feature_name: str) → Any

Subclass this method to handle custom copying of specific features

Parameters:: feature_name (str) – Feature name to copy
Raises:: NotImeplementedError if no method to handle feature –

normalize_features(columns: str | list[str] | None = None) → _Self

Normalize features using min max scaling

Parameters:: columns (str or list of strs) – Names of feature columns to normalize
Return type:: A copy of this AbstractStructure with normalized features in the dataframe

get_atoms(include_hetatms: bool = False, exclude_atoms: bool | None = None, include_atoms: bool | None = None) → Iterator[Any]

Subclass to enumerate protein model for all atoms with options to filter

Parameters:

include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

filter_atoms(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None, include_atoms: list[Any] | None = None) → Iterator[Any]

Subclass to enumerate protein model for all atoms with options to filter

Parameters:

include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

get_surface() → Any: Returns all surface atoms, using DSSP accessible surface value”

get_bfactors() → Any: Get bfactors for all atoms

save_pdb(path: str | None = None, header: str | list[str] | None = None, file_like: bool = False, rewind: bool = True) → str | IO

Write PDB to file

Parameters:

path (None or str) – Path to save PDB file. If None, file_like needs to be True.
header (str or list of strs) – Header string to write to the beginning of each PDB file
file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.
rewind (boolean) – If returning a file-like object, rewind the beginning of the file

Return type:

None or file-like object of PDB file data

write_features(features: str | list[str] | None, coarse_grained: bool = False, name: str | None = None, work_dir: str | None = None) → None

Subclass to write features to a spefic file, depnignd on protein loading class, e.g. HDF

Parameters:

features (str or list of strs) – Features to write
course_grained (boolean) – Include features only at the residue level. Default False
name (str) – File name to write file
work_dir (None or str) – Directory to save file

Write features to PDB files in the bfactor column. One feature per PDB file.

Parameters:

features_to_use (str or list of strs) – Features to write
name (None or str) – File name to write file
course_grained (boolean) – Include features only at the residue level. Default False
work_dir (None or str) – Directory to save files.
other (obj) – Ignored. May be useful for subclasses.

Return type:

File names of each written PDB file for each feature

add_features(coarse_grained: bool = False, **features): Subclass to add a feature column to dataset

get_coords(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None) → Any

Subclass to return all XYZ coordinates fro each atom

Parameters:

include_hetatms (bool) – Include heteroatoms or not. Default is False.
exclude_atoms (list of atoms) – Atoms to exlude while getting coordinates

Return type:

XYZ coordinates from each atom in a specified format of subclass

get_sequence() → str: Get amino acid sequence from structure

update_coords(coords: array) → None: Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms

get_mean_coord() → array: Get the mean XYZ coordinate or center of mass.

get_max_coord() → array: Get the maximum coordinate in each dimanesion

get_min_coord() → array: Get the minimum coordinate in each dimanesion

get_max_length(buffer: float = 0.0, pct_buffer: float = 0.0) → float

Get the length of the protein to create a volume around

Parameters:

buffer (float) – Amount of space to incldue around volume in Angstroms. Defualt 0
pct_buffer (float) – Amount of space to incldue around volume in as percentage of the total legnth in Angstroms. Defualt 0

Returns:

length – Max length of protein

Return type:

float

shift_coords(new_center: array | None = None, from_origin: array | None = True) → array

Shift coordinates by setting a new center of mass value or shift to the origin

Parameters:

new_center (3-tuple of floats or None) – XYZ cooridnate of new center. if new_center is None, it will shift to the origin. Default is None.
from_origin (bool) – Start shift from the origin by first subratcting center of mass. Defualt is True.

Return type:

The new center coordinate

shift_coords_to_origin() → float

Center structure at the origin

Return type:: The new center coordinate

orient_to_pai(random_flip: bool = False, flip_axis: list[float] | array = (0.2, 0.2, 0.2)) → None

Orient structure to the Principle Axis of Interertia and optionally flip. Modified from EnzyNet

Parameters:

random_flip (bool) – Randomly flip around axis. Defualt is False.
flip_axis (3-tuple of floats) – New axis to flip.

rotate(rvs: array | None = None, num: int = 1, return_to: tuple[float] | array | None = None) → Iterator[tuple[int, array]]

Rotate structure by either randomly in place or with a set rotation matrix. Random rotations matrices are drawn from the Haar distribution (the only uniform distribution on SO(3)) from scipy.

Parameters:

rvs (np.array (3x3)) – A rotation matrix. If None, a randome roation matrix is used. Default is None.
num (int) – Number of rotations to perfom
return_to (XYZ coordinate) – When finsihed rotating, move structure to this coordinate. Defualt is to the center of mass

Yields:

r (int) – Rotation number
M (np.array (3x3)) – Rotation matrix

update_bfactors(b_factors: list[Any]) → None: Sublcass to create method to update bfactors with a new set of bfactors for the same atoms

calculate_neighbors(d_cutoff: float = 100.0) → Any

Subclass to find all nearest neighbors within a given radius.

Parameters:: d_cutoff (float) – Distance cutoff to find neighbors. Deualt is 100 Angtroms

get_vdw(element_name: str, residue: bool = False) → float: Get van der walls radii for an atom or residue

get_dihedral_angles(atom_or_residue: Any) → float: Get deidral angle for atom (mapped up to residue) or the residue

get_secondary_structures_groups(verbose: bool = False, ss_min_len: int = 3, is_ob: bool = False, assume_correct: Series | None = None) → tuple[list[DataFrame], dict[tuple[str], DataFrame], dict[tuple[str], int], dict[tuple[str], str], dict[int, list[Any]], int]

Get groups of adjecent atom rows the belong to the same secondary structure. We use DSSP to assing secondary structures to each reisdue mapped down to atoms. If any segment was <4 residues, they were merged the previous and next groups

Returns:

ss_groups (list of pd.DataFrames of atoms in each group)
loop_for_ss (dict of pd.DataFrames of atoms in the loop following each ss group)
original_order (dict)
ss_type (dict)
leading_trailing_residues (dict)
number_ss (int)

remove_loops(verbose: bool = False) → None

Prop3D.common.DistributedStructure module

class Prop3D.common.DistributedStructure.DistributedStructure(path: str, key: str, cath_domain_dataset: str | None = None, coarse_grained: bool = False)

Bases: AbstractStructure

A structure class to deal with structures originated from a distributed: HSDS instance.

Parameters:

path (str) – path to h5 file in HSDS endpoint to access structures
key (str) – Key to access speficic protein inside the HDF file
cath_domain_dataset (str) – The CATH superfamily if endpoint is setup to use CATH (use ‘/’ instead of ‘.’)
coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.

deep_copy_feature(feature_name: str, memo: Any) → Any

Deep copy a specific feature

Parameters:

feature_name (str) – Feature name to copy
memo – objects to pass to deepcopy

Raises:

NotImeplementedError if no method to handle feature –

get_atoms(atoms: array | None = None, include_hetatms: bool = False, exclude_atoms: list[int] | None = None, include_atoms: list[int] | None = None) → Iterator[array]

Enumerate over all atoms with options to filter

Parameters:

include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

get_residues() → Iterator[array]: Yields slices of the data for all atoms in single residue

unfold_entities(entity_list: array, target_level: str = 'A') → Iterator[array]

Map lower level such as atoms (single row) into higher entites such as residues (multiple rows). Only works for atoms and chainsAdapted from BioPython

Parameters:

entity_list (list) – List of entites to unfold
target_level (str) – level to map to, eg: ‘A’ for atom, ‘R’ for residue

Yields:

Either single row atoms or multiple rows for residues

save_pdb(path: str | None = None, header: str | None = None, file_like: str | None = False, rewind: bool = True) → str | IO

Write PDB to file

Parameters:

path (None or str) – Path to save PDB file. If None, file_like needs to be True.
header (str or list of strs) – Header string to write to the beginning of each PDB file
file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.
rewind (boolean) – If returning a file-like object, rewind the beginning of the file

Return type:

None or file-like object of PDB file data

get_bfactors() → array: Get bfactors for all atoms

Write features to an hdf file

Parameters:

path (str) – Path to sve HDF file
key (str) – Key to save dataset inside HDF file
features (str or list of strs) – Features to write
course_grained (boolean) – Include features only at the residue level. Default False
name (str) – File name to write file
work_dir (None or str) – Directory to save file
force (bool) – Not used
bool (multiple) – Not used

add_features(coarse_grained: bool = False, **features: Any): Add a feature column to dataset

get_coords() → array: Get XYZ coordinates for all atoms as numpy array

get_coord(atom: int) → array

Get XYZ coordinates for an atom

Parameters:: atom (int) – Serial number of atom
Return type:: xyz coordinate of atom

get_elem(atom: int) → str

Get element type for an atom

Parameters:: atom (int) – Serial number of atom
Return type:: element name of atom

update_bfactors(b_factors: array) → None: Reset bfactors for all atoms. New numpy array must be same length as the atom array

update_coords(coords: array) → None: Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms

calculate_neighbors(d_cutoff: float = 100.0) → Iterator[tuple[array, array]]

Calculates intermolecular contacts in a parsed struct object.

Parameters:: d_cuttoff (float) – Distance to find neighbors
Returns:: A list of lists of nearby elements at the specified level
Return type:: [(a1,b2),]

get_vdw(atom_or_residue: array) → float: Get Van der Waals radius for an atom or if its a residue, return an appmate volume as a sphere around all atoms in residue

remove_loops(verbose: bool = False) → None: Remove atoms present in loop regions

Prop3D.common.DistributedVoxelizedStructure module

Prop3D.common.LocalStructure module

Prop3D.common.ProteinTables module

Useful tables of protein properties

Prop3D.common.ProteinTables.three_to_one(aa_name)

Prop3D.common.ProteinTables.to_int(row)

Prop3D.common.ProteinTables.atoms_to_aa(atoms, raise_unknown=True)

Prop3D.common.features module

Prop3D.common.featurizer module

Module contents

Meadowlark

A collection on scripts to process indidual protein structures for use in machine learning tasks. Proteins can be:

‘Cleaned’ by adding missing residues and atoms;
Featurized with atom- and residue-based biophysical prooperties calculated using known structural bioinformatics tool that have been Dockerized (see Prop3D.ml).
Convert proteins along with there features into sparse 3D volumes for use in Sparse 3DCNNs