Prop3D.common package

Submodules

Prop3D.common.AbstractStructure module

class Prop3D.common.AbstractStructure.AbstractStructure(name: str, file_mode: str = 'r', coarse_grained: bool = False)

Bases: object

A base structure class that holds default methods for dealing with structures. Subclasses can be created to use different protein libraires, e.g. BioPython, BioTite, our own HDF/HSDS distributed stucture.

Parameters:
  • name (str) – Name of proten, e.g. PDB or CATH id

  • file_mode (str (r, w, r+, w+, etc)) – Open file for reading or writing. Defualt is just reading, no methods will affect underlying file

  • coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.

copy(empty: bool = False) _Self

Create a deep copy of current structure.

Parameters:

empty ((deprecated)) – Don’t copy features

deep_copy_feature(feature_name: str) Any

Subclass this method to handle custom copying of specific features

Parameters:

feature_name (str) – Feature name to copy

Raises:

NotImeplementedError if no method to handle feature

normalize_features(columns: str | list[str] | None = None) _Self

Normalize features using min max scaling

Parameters:

columns (str or list of strs) – Names of feature columns to normalize

Return type:

A copy of this AbstractStructure with normalized features in the dataframe

get_atoms(include_hetatms: bool = False, exclude_atoms: bool | None = None, include_atoms: bool | None = None) Iterator[Any]

Subclass to enumerate protein model for all atoms with options to filter

Parameters:
  • include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.

  • exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)

  • inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

filter_atoms(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None, include_atoms: list[Any] | None = None) Iterator[Any]

Subclass to enumerate protein model for all atoms with options to filter

Parameters:
  • include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.

  • exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)

  • inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

get_surface() Any

Returns all surface atoms, using DSSP accessible surface value”

get_bfactors() Any

Get bfactors for all atoms

save_pdb(path: str | None = None, header: str | list[str] | None = None, file_like: bool = False, rewind: bool = True) str | IO

Write PDB to file

Parameters:
  • path (None or str) – Path to save PDB file. If None, file_like needs to be True.

  • header (str or list of strs) – Header string to write to the beginning of each PDB file

  • file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.

  • rewind (boolean) – If returning a file-like object, rewind the beginning of the file

Return type:

None or file-like object of PDB file data

write_features(features: str | list[str] | None, coarse_grained: bool = False, name: str | None = None, work_dir: str | None = None) None

Subclass to write features to a spefic file, depnignd on protein loading class, e.g. HDF

Parameters:
  • features (str or list of strs) – Features to write

  • course_grained (boolean) – Include features only at the residue level. Default False

  • name (str) – File name to write file

  • work_dir (None or str) – Directory to save file

write_features_to_pdb(features_to_use: str | list[str] | None, name: str | None = None, coarse_grain: bool = False, work_dir: str | None = None, other: Any | None = None)

Write features to PDB files in the bfactor column. One feature per PDB file.

Parameters:
  • features_to_use (str or list of strs) – Features to write

  • name (None or str) – File name to write file

  • course_grained (boolean) – Include features only at the residue level. Default False

  • work_dir (None or str) – Directory to save files.

  • other (obj) – Ignored. May be useful for subclasses.

Return type:

File names of each written PDB file for each feature

add_features(coarse_grained: bool = False, **features)

Subclass to add a feature column to dataset

get_coords(include_hetatms: bool = False, exclude_atoms: list[Any] | None = None) Any

Subclass to return all XYZ coordinates fro each atom

Parameters:
  • include_hetatms (bool) – Include heteroatoms or not. Default is False.

  • exclude_atoms (list of atoms) – Atoms to exlude while getting coordinates

Return type:

XYZ coordinates from each atom in a specified format of subclass

get_sequence() str

Get amino acid sequence from structure

update_coords(coords: array) None

Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms

get_mean_coord() array

Get the mean XYZ coordinate or center of mass.

get_max_coord() array

Get the maximum coordinate in each dimanesion

get_min_coord() array

Get the minimum coordinate in each dimanesion

get_max_length(buffer: float = 0.0, pct_buffer: float = 0.0) float

Get the length of the protein to create a volume around

Parameters:
  • buffer (float) – Amount of space to incldue around volume in Angstroms. Defualt 0

  • pct_buffer (float) – Amount of space to incldue around volume in as percentage of the total legnth in Angstroms. Defualt 0

Returns:

length – Max length of protein

Return type:

float

shift_coords(new_center: array | None = None, from_origin: array | None = True) array

Shift coordinates by setting a new center of mass value or shift to the origin

Parameters:
  • new_center (3-tuple of floats or None) – XYZ cooridnate of new center. if new_center is None, it will shift to the origin. Default is None.

  • from_origin (bool) – Start shift from the origin by first subratcting center of mass. Defualt is True.

Return type:

The new center coordinate

shift_coords_to_origin() float

Center structure at the origin

Return type:

The new center coordinate

orient_to_pai(random_flip: bool = False, flip_axis: list[float] | array = (0.2, 0.2, 0.2)) None

Orient structure to the Principle Axis of Interertia and optionally flip. Modified from EnzyNet

Parameters:
  • random_flip (bool) – Randomly flip around axis. Defualt is False.

  • flip_axis (3-tuple of floats) – New axis to flip.

rotate(rvs: array | None = None, num: int = 1, return_to: tuple[float] | array | None = None) Iterator[tuple[int, array]]

Rotate structure by either randomly in place or with a set rotation matrix. Random rotations matrices are drawn from the Haar distribution (the only uniform distribution on SO(3)) from scipy.

Parameters:
  • rvs (np.array (3x3)) – A rotation matrix. If None, a randome roation matrix is used. Default is None.

  • num (int) – Number of rotations to perfom

  • return_to (XYZ coordinate) – When finsihed rotating, move structure to this coordinate. Defualt is to the center of mass

Yields:
  • r (int) – Rotation number

  • M (np.array (3x3)) – Rotation matrix

update_bfactors(b_factors: list[Any]) None

Sublcass to create method to update bfactors with a new set of bfactors for the same atoms

calculate_neighbors(d_cutoff: float = 100.0) Any

Subclass to find all nearest neighbors within a given radius.

Parameters:

d_cutoff (float) – Distance cutoff to find neighbors. Deualt is 100 Angtroms

get_vdw(element_name: str, residue: bool = False) float

Get van der walls radii for an atom or residue

get_dihedral_angles(atom_or_residue: Any) float

Get deidral angle for atom (mapped up to residue) or the residue

get_secondary_structures_groups(verbose: bool = False, ss_min_len: int = 3, is_ob: bool = False, assume_correct: Series | None = None) tuple[list[DataFrame], dict[tuple[str], DataFrame], dict[tuple[str], int], dict[tuple[str], str], dict[int, list[Any]], int]

Get groups of adjecent atom rows the belong to the same secondary structure. We use DSSP to assing secondary structures to each reisdue mapped down to atoms. If any segment was <4 residues, they were merged the previous and next groups

Returns:

  • ss_groups (list of pd.DataFrames of atoms in each group)

  • loop_for_ss (dict of pd.DataFrames of atoms in the loop following each ss group)

  • original_order (dict)

  • ss_type (dict)

  • leading_trailing_residues (dict)

  • number_ss (int)

remove_loops(verbose: bool = False) None

Prop3D.common.DistributedStructure module

class Prop3D.common.DistributedStructure.DistributedStructure(path: str, key: str, cath_domain_dataset: str | None = None, coarse_grained: bool = False)

Bases: AbstractStructure

A structure class to deal with structures originated from a distributed

HSDS instance.

Parameters:
  • path (str) – path to h5 file in HSDS endpoint to access structures

  • key (str) – Key to access speficic protein inside the HDF file

  • cath_domain_dataset (str) – The CATH superfamily if endpoint is setup to use CATH (use ‘/’ instead of ‘.’)

  • coarse_grained (boolean) – Use a residue only model instead of an all atom model. Defualt False. Warning, not fully implemented.

deep_copy_feature(feature_name: str, memo: Any) Any

Deep copy a specific feature

Parameters:
  • feature_name (str) – Feature name to copy

  • memo – objects to pass to deepcopy

Raises:

NotImeplementedError if no method to handle feature

get_atoms(atoms: array | None = None, include_hetatms: bool = False, exclude_atoms: list[int] | None = None, include_atoms: list[int] | None = None) Iterator[array]

Enumerate over all atoms with options to filter

Parameters:
  • include_hetatms (boolean) – Inclue hetero atoms or not. Default is False.

  • exlude_atoms (list) – List of atoms to skip during enumeration (depends on model if id or pyton object)

  • inlude_atoms (list) – List of atoms to inllude during enumeration (depends on model if id or pyton object)

get_residues() Iterator[array]

Yields slices of the data for all atoms in single residue

unfold_entities(entity_list: array, target_level: str = 'A') Iterator[array]

Map lower level such as atoms (single row) into higher entites such as residues (multiple rows). Only works for atoms and chainsAdapted from BioPython

Parameters:
  • entity_list (list) – List of entites to unfold

  • target_level (str) – level to map to, eg: ‘A’ for atom, ‘R’ for residue

Yields:

Either single row atoms or multiple rows for residues

save_pdb(path: str | None = None, header: str | None = None, file_like: str | None = False, rewind: bool = True) str | IO

Write PDB to file

Parameters:
  • path (None or str) – Path to save PDB file. If None, file_like needs to be True.

  • header (str or list of strs) – Header string to write to the beginning of each PDB file

  • file_like (boolean) – Return a StringIO object of the PDB file, do not write to disk. Default False.

  • rewind (boolean) – If returning a file-like object, rewind the beginning of the file

Return type:

None or file-like object of PDB file data

get_bfactors() array

Get bfactors for all atoms

write_features(path: str | None = None, key: str | None = None, features=typing.Union[str, list[str], NoneType], coarse_grained: bool = False, name: str | None = None, work_dir: str | None = None, force: bool | int | None = None, multiple: bool = False) None

Write features to an hdf file

Parameters:
  • path (str) – Path to sve HDF file

  • key (str) – Key to save dataset inside HDF file

  • features (str or list of strs) – Features to write

  • course_grained (boolean) – Include features only at the residue level. Default False

  • name (str) – File name to write file

  • work_dir (None or str) – Directory to save file

  • force (bool) – Not used

  • bool (multiple) – Not used

add_features(coarse_grained: bool = False, **features: Any)

Add a feature column to dataset

get_coords() array

Get XYZ coordinates for all atoms as numpy array

get_coord(atom: int) array

Get XYZ coordinates for an atom

Parameters:

atom (int) – Serial number of atom

Return type:

xyz coordinate of atom

get_elem(atom: int) str

Get element type for an atom

Parameters:

atom (int) – Serial number of atom

Return type:

element name of atom

update_bfactors(b_factors: array) None

Reset bfactors for all atoms. New numpy array must be same length as the atom array

update_coords(coords: array) None

Sublcass to create method to update XYZ coordinates with a new set of coordinates for the same atoms

calculate_neighbors(d_cutoff: float = 100.0) Iterator[tuple[array, array]]

Calculates intermolecular contacts in a parsed struct object.

Parameters:

d_cuttoff (float) – Distance to find neighbors

Returns:

A list of lists of nearby elements at the specified level

Return type:

[(a1,b2),]

get_vdw(atom_or_residue: array) float

Get Van der Waals radius for an atom or if its a residue, return an appmate volume as a sphere around all atoms in residue

remove_loops(verbose: bool = False) None

Remove atoms present in loop regions

Prop3D.common.DistributedVoxelizedStructure module

Prop3D.common.LocalStructure module

Prop3D.common.ProteinTables module

Useful tables of protein properties

Prop3D.common.ProteinTables.three_to_one(aa_name)
Prop3D.common.ProteinTables.to_int(row)
Prop3D.common.ProteinTables.atoms_to_aa(atoms, raise_unknown=True)

Prop3D.common.features module

Prop3D.common.featurizer module

Module contents

Meadowlark

A collection on scripts to process indidual protein structures for use in machine learning tasks. Proteins can be:

  1. ‘Cleaned’ by adding missing residues and atoms;

  2. Featurized with atom- and residue-based biophysical prooperties calculated using known structural bioinformatics tool that have been Dockerized (see Prop3D.ml).

  3. Convert proteins along with there features into sparse 3D volumes for use in Sparse 3DCNNs