Prop3D.common package
Submodules
Prop3D.common.AbstractStructure module
- class Prop3D.common.AbstractStructure.AbstractStructure(name: str, file_mode: str = 'r', coarse_grained: bool = False)
Bases:
objectA base structure class that holds default methods for dealing with structures. Subclasses can be created to use different protein libraires, e.g. BioPython, BioTite, our own HDF/HSDS distributed stucture.
- Parameters:
name (str) – Name of proten, e.g. PDB or CATH id
file_mode (str (r, w, r+, w+, etc)) – Open file for reading or writing. Defualt is just reading, no methods will affect underlying file
coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.
- copy(empty: bool = False) _Self
Create a deep copy of current structure.
- Parameters:
empty ((deprecated)) – Don’t copy features
- deep_copy_feature(feature_name: str) Any
Subclass this method to handle custom copying of specific features
- Parameters:
feature_name (str) – Feature name to copy
- Raises:
NotImeplementedError if no method to handle feature –
- normalize_features(columns: str | list[str] | None = None) _Self
Normalize features using min max scaling
- Parameters:
columns (str or list of strs) – Names of feature columns to normalize
- Return type:
A copy of this AbstractStructure with normalized features in the dataframe
- get_atoms(include_hetatms: bool = False, exclude_atoms: bool | None = None, include_atoms: bool | None = None) Iterator[Any]
Subclass to enumerate protein model for all atoms with options to filter
- Parameters:
include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)
- filter_atoms(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None, include_atoms: list[Any] | None = None) Iterator[Any]
Subclass to enumerate protein model for all atoms with options to filter
- Parameters:
include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)
- get_surface() Any
Returns all surface atoms, using DSSP accessible surface value”
- get_bfactors() Any
Get bfactors for all atoms
- save_pdb(path: str | None = None, header: str | list[str] | None = None, file_like: bool = False, rewind: bool = True) str | IO
Write PDB to file
- Parameters:
path (None or str) – Path to save PDB file. If None, file_like needs to be True.
header (str or list of strs) – Header string to write to the beginning of each PDB file
file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.
rewind (boolean) – If returning a file-like object, rewind the beginning of the file
- Return type:
None or file-like object of PDB file data
- write_features(features: str | list[str] | None, coarse_grained: bool = False, name: str | None = None, work_dir: str | None = None) None
Subclass to write features to a spefic file, depnignd on protein loading class, e.g. HDF
- Parameters:
features (str or list of strs) – Features to write
course_grained (boolean) – Include features only at the residue level. Default False
name (str) – File name to write file
work_dir (None or str) – Directory to save file
- write_features_to_pdb(features_to_use: str | list[str] | None, name: str | None = None, coarse_grain: bool = False, work_dir: str | None = None, other: Any | None = None)
Write features to PDB files in the bfactor column. One feature per PDB file.
- Parameters:
features_to_use (str or list of strs) – Features to write
name (None or str) – File name to write file
course_grained (boolean) – Include features only at the residue level. Default False
work_dir (None or str) – Directory to save files.
other (obj) – Ignored. May be useful for subclasses.
- Return type:
File names of each written PDB file for each feature
- add_features(coarse_grained: bool = False, **features)
Subclass to add a feature column to dataset
- get_coords(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None) Any
Subclass to return all XYZ coordinates fro each atom
- Parameters:
include_hetatms (bool) – Include heteroatoms or not. Default is False.
exclude_atoms (list of atoms) – Atoms to exlude while getting coordinates
- Return type:
XYZ coordinates from each atom in a specified format of subclass
- get_sequence() str
Get amino acid sequence from structure
- update_coords(coords: array) None
Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms
- get_mean_coord() array
Get the mean XYZ coordinate or center of mass.
- get_max_coord() array
Get the maximum coordinate in each dimanesion
- get_min_coord() array
Get the minimum coordinate in each dimanesion
- get_max_length(buffer: float = 0.0, pct_buffer: float = 0.0) float
Get the length of the protein to create a volume around
- Parameters:
buffer (float) – Amount of space to incldue around volume in Angstroms. Defualt 0
pct_buffer (float) – Amount of space to incldue around volume in as percentage of the total legnth in Angstroms. Defualt 0
- Returns:
length – Max length of protein
- Return type:
float
- shift_coords(new_center: array | None = None, from_origin: array | None = True) array
Shift coordinates by setting a new center of mass value or shift to the origin
- Parameters:
new_center (3-tuple of floats or None) – XYZ cooridnate of new center. if new_center is None, it will shift to the origin. Default is None.
from_origin (bool) – Start shift from the origin by first subratcting center of mass. Defualt is True.
- Return type:
The new center coordinate
- shift_coords_to_origin() float
Center structure at the origin
- Return type:
The new center coordinate
- orient_to_pai(random_flip: bool = False, flip_axis: list[float] | array = (0.2, 0.2, 0.2)) None
Orient structure to the Principle Axis of Interertia and optionally flip. Modified from EnzyNet
- Parameters:
random_flip (bool) – Randomly flip around axis. Defualt is False.
flip_axis (3-tuple of floats) – New axis to flip.
- rotate(rvs: array | None = None, num: int = 1, return_to: tuple[float] | array | None = None) Iterator[tuple[int, array]]
Rotate structure by either randomly in place or with a set rotation matrix. Random rotations matrices are drawn from the Haar distribution (the only uniform distribution on SO(3)) from scipy.
- Parameters:
rvs (np.array (3x3)) – A rotation matrix. If None, a randome roation matrix is used. Default is None.
num (int) – Number of rotations to perfom
return_to (XYZ coordinate) – When finsihed rotating, move structure to this coordinate. Defualt is to the center of mass
- Yields:
r (int) – Rotation number
M (np.array (3x3)) – Rotation matrix
- update_bfactors(b_factors: list[Any]) None
Sublcass to create method to update bfactors with a new set of bfactors for the same atoms
- calculate_neighbors(d_cutoff: float = 100.0) Any
Subclass to find all nearest neighbors within a given radius.
- Parameters:
d_cutoff (float) – Distance cutoff to find neighbors. Deualt is 100 Angtroms
- get_vdw(element_name: str, residue: bool = False) float
Get van der walls radii for an atom or residue
- get_dihedral_angles(atom_or_residue: Any) float
Get deidral angle for atom (mapped up to residue) or the residue
- get_secondary_structures_groups(verbose: bool = False, ss_min_len: int = 3, is_ob: bool = False, assume_correct: Series | None = None) tuple[list[DataFrame], dict[tuple[str], DataFrame], dict[tuple[str], int], dict[tuple[str], str], dict[int, list[Any]], int]
Get groups of adjecent atom rows the belong to the same secondary structure. We use DSSP to assing secondary structures to each reisdue mapped down to atoms. If any segment was <4 residues, they were merged the previous and next groups
- Returns:
ss_groups (list of pd.DataFrames of atoms in each group)
loop_for_ss (dict of pd.DataFrames of atoms in the loop following each ss group)
original_order (dict)
ss_type (dict)
leading_trailing_residues (dict)
number_ss (int)
- remove_loops(verbose: bool = False) None
Prop3D.common.DistributedStructure module
- class Prop3D.common.DistributedStructure.DistributedStructure(path: str, key: str, cath_domain_dataset: str | None = None, coarse_grained: bool = False)
Bases:
AbstractStructure- A structure class to deal with structures originated from a distributed
HSDS instance.
- Parameters:
path (str) – path to h5 file in HSDS endpoint to access structures
key (str) – Key to access speficic protein inside the HDF file
cath_domain_dataset (str) – The CATH superfamily if endpoint is setup to use CATH (use ‘/’ instead of ‘.’)
coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.
- deep_copy_feature(feature_name: str, memo: Any) Any
Deep copy a specific feature
- Parameters:
feature_name (str) – Feature name to copy
memo – objects to pass to deepcopy
- Raises:
NotImeplementedError if no method to handle feature –
- get_atoms(atoms: array | None = None, include_hetatms: bool = False, exclude_atoms: list[int] | None = None, include_atoms: list[int] | None = None) Iterator[array]
Enumerate over all atoms with options to filter
- Parameters:
include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.
exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)
inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)
- get_residues() Iterator[array]
Yields slices of the data for all atoms in single residue
- unfold_entities(entity_list: array, target_level: str = 'A') Iterator[array]
Map lower level such as atoms (single row) into higher entites such as residues (multiple rows). Only works for atoms and chainsAdapted from BioPython
- Parameters:
entity_list (list) – List of entites to unfold
target_level (str) – level to map to, eg: ‘A’ for atom, ‘R’ for residue
- Yields:
Either single row atoms or multiple rows for residues
- save_pdb(path: str | None = None, header: str | None = None, file_like: str | None = False, rewind: bool = True) str | IO
Write PDB to file
- Parameters:
path (None or str) – Path to save PDB file. If None, file_like needs to be True.
header (str or list of strs) – Header string to write to the beginning of each PDB file
file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.
rewind (boolean) – If returning a file-like object, rewind the beginning of the file
- Return type:
None or file-like object of PDB file data
- get_bfactors() array
Get bfactors for all atoms
- write_features(path: str | None = None, key: str | None = None, features=typing.Union[str, list[str], NoneType], coarse_grained: bool = False, name: str | None = None, work_dir: str | None = None, force: bool | int | None = None, multiple: bool = False) None
Write features to an hdf file
- Parameters:
path (str) – Path to sve HDF file
key (str) – Key to save dataset inside HDF file
features (str or list of strs) – Features to write
course_grained (boolean) – Include features only at the residue level. Default False
name (str) – File name to write file
work_dir (None or str) – Directory to save file
force (bool) – Not used
bool (multiple) – Not used
- add_features(coarse_grained: bool = False, **features: Any)
Add a feature column to dataset
- get_coords() array
Get XYZ coordinates for all atoms as numpy array
- get_coord(atom: int) array
Get XYZ coordinates for an atom
- Parameters:
atom (int) – Serial number of atom
- Return type:
xyz coordinate of atom
- get_elem(atom: int) str
Get element type for an atom
- Parameters:
atom (int) – Serial number of atom
- Return type:
element name of atom
- update_bfactors(b_factors: array) None
Reset bfactors for all atoms. New numpy array must be same length as the atom array
- update_coords(coords: array) None
Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms
- calculate_neighbors(d_cutoff: float = 100.0) Iterator[tuple[array, array]]
Calculates intermolecular contacts in a parsed struct object.
- Parameters:
d_cuttoff (float) – Distance to find neighbors
- Returns:
A list of lists of nearby elements at the specified level
- Return type:
[(a1,b2),]
- get_vdw(atom_or_residue: array) float
Get Van der Waals radius for an atom or if its a residue, return an appmate volume as a sphere around all atoms in residue
- remove_loops(verbose: bool = False) None
Remove atoms present in loop regions
Prop3D.common.DistributedVoxelizedStructure module
Prop3D.common.LocalStructure module
Prop3D.common.ProteinTables module
Useful tables of protein properties
- Prop3D.common.ProteinTables.three_to_one(aa_name)
- Prop3D.common.ProteinTables.to_int(row)
- Prop3D.common.ProteinTables.atoms_to_aa(atoms, raise_unknown=True)
Prop3D.common.features module
Prop3D.common.featurizer module
Module contents
Meadowlark
A collection on scripts to process indidual protein structures for use in machine learning tasks. Proteins can be:
‘Cleaned’ by adding missing residues and atoms;
Featurized with atom- and residue-based biophysical prooperties calculated using known structural bioinformatics tool that have been Dockerized (see Prop3D.ml).
Convert proteins along with there features into sparse 3D volumes for use in Sparse 3DCNNs