Features
Prop3D calculates the features for every atom in every protein/domain in a dataset, saving all results into different HDF datasets for each protein.
For atom-level features, we create one-hot encodings for 23 AutoDock atom names, 16 element names, and 21 residue types (20 standard amino acids and one UNKnown placeholder). We also include van der Waals radii, charges from Pdb2Pqr (Q62856803), electrostatic potentials computed via APBS (Q65072984), concavity values that we calculate via CX (Q114841750), various hydrophobicity features of the residue that an atom belongs to (Kyte-Doolite, Biological and Octanol), and two measures of accessible surface area (per-atom, via FreeSASA (Q114841793), and per-residue via DSSP (Q5206192)). We also include different types of secondary structure information: one-hot encodings for DSSP (Q5206192)) 3- and 7- secondary structure classifications, as well as the backbone torsion angles ($phi$, $psi$; along with embedded sine and cosine transformations of each). We also annotate aromaticity, and hydrogen bond acceptors and donors, based on AutoDock atom-name types. As a gauge of phylogenetic conservation, we include sequence entropy scores from EPPIC (Q114841783). These biophysical, physicochemical, structural, and phylogenetic features are summarized above and are exhaustively enumerated in the tables below.
We also provide functionality to create discretized values of features via the application of Boolean logic operators to the corresponding continuous-valued quantities of a given descriptor, using simple numerical thresholding (Table ref{table:bool-feats}).
All Features
Feature
|
Voxel
Aggregation
Rule
|
Residue
Aggregation
Rule
|
Source Software
or Database
|
|---|---|---|---|
H |
max |
MGLTools (Q114840701) |
|
HD |
max |
MGLTools (Q114840701) |
|
HS |
max |
MGLTools (Q114840701) |
|
C |
max |
MGLTools (Q114840701) |
|
A |
max |
MGLTools (Q114840701) |
|
N |
max |
MGLTools (Q114840701) |
|
NA |
max |
MGLTools (Q114840701) |
|
NS |
max |
MGLTools (Q114840701) |
|
OA |
max |
MGLTools (Q114840701) |
|
OS |
max |
MGLTools (Q114840701) |
|
F |
max |
MGLTools (Q114840701) |
|
MG |
max |
MGLTools (Q114840701) |
|
P |
max |
MGLTools (Q114840701) |
|
SA |
max |
MGLTools (Q114840701) |
|
S |
max |
MGLTools (Q114840701) |
|
CL |
max |
MGLTools (Q114840701) |
|
CA |
max |
MGLTools (Q114840701) |
|
MN |
max |
MGLTools (Q114840701) |
|
FE |
max |
MGLTools (Q114840701) |
|
ZN |
max |
MGLTools (Q114840701) |
|
BR |
max |
MGLTools (Q114840701) |
|
I |
max |
MGLTools (Q114840701) |
|
Unk_atom |
max |
MGLTools (Q114840701) |
|
C_elem |
max |
PDB File |
|
N_elem |
max |
PDB File |
|
O_elem |
max |
PDB File |
|
S_elem |
max |
PDB File |
|
H_elem |
max |
PDB File |
|
F_elem |
max |
PDB File |
|
MG_elem |
max |
PDB File |
|
P_elem |
max |
PDB File |
|
CL_elem |
max |
PDB File |
|
CA_elem |
max |
PDB File |
|
MN_elem |
max |
PDB File |
|
FE_elem |
max |
PDB File |
|
ZN_elem |
max |
PDB File |
|
BR_elem |
max |
PDB File |
|
I_elem |
max |
PDB File |
|
Unk_elem |
max |
PDB File |
|
vdw |
mean |
yes |
cite{Bondi1964} |
partial charge (charge) |
mean |
sum |
Pdb2Pqr (Q62856803) |
electrostatic_potential |
mean |
sum |
APBS (Q65072984) |
concavity (cx) |
mean |
mean |
CX (Q114841750) |
hydrophobicity |
mean |
yes |
Kyte-Doolittle |
biological_hydrophobicity |
mean |
yes |
cite{Hessa2005} |
octanal_hydrophobicity |
mean |
yes |
Wimley-White 1996 |
atom_asa |
mean |
FreeSASA (Q114841793) |
|
residue_rasa |
mean |
yes |
DSSP (Q5206192) |
ALA |
max |
yes |
PDB File |
CYS |
max |
yes |
PDB File |
ASP |
max |
yes |
PDB File |
GLU |
max |
yes |
PDB File |
PHE |
max |
yes |
PDB File |
GLY |
max |
yes |
PDB File |
HIS |
max |
yes |
PDB File |
ILE |
max |
yes |
PDB File |
LYS |
max |
yes |
PDB File |
LEU |
max |
yes |
PDB File |
MET |
max |
yes |
PDB File |
ASN |
max |
yes |
PDB File |
PRO |
max |
yes |
PDB File |
GLN |
max |
yes |
PDB File |
ARG |
max |
yes |
PDB File |
SER |
max |
yes |
PDB File |
THR |
max |
yes |
PDB File |
VAL |
max |
yes |
PDB File |
TRP |
max |
yes |
PDB File |
TYR |
max |
yes |
PDB File |
Unk_residue |
max |
yes |
PDB File |
phi |
mean |
yes |
BioPython (Q4118434) |
phi_sin |
mean |
yes |
NumPy |
phi_cos |
mean |
yes |
NumPy |
psi |
mean |
yes |
BioPython (Q4118434) |
psi_sin |
mean |
yes |
NumPy |
psi_cos |
mean |
yes |
NumPy |
is_helix |
max |
yes |
DSSP (Q5206192) |
is_sheet |
max |
yes |
DSSP (Q5206192) |
Unk_SS |
max |
yes |
DSSP (Q5206192) |
is_regular_helix |
max |
yes |
DSSP (Q5206192) |
is_beta_bridge |
max |
yes |
DSSP (Q5206192) |
is_extended_strand |
max |
yes |
DSSP (Q5206192) |
is_310_helix |
max |
yes |
DSSP (Q5206192) |
is_pi_helix |
max |
yes |
DSSP (Q5206192) |
is_hbond_turn |
max |
yes |
DSSP (Q5206192) |
is_bend |
max |
yes |
DSSP (Q5206192) |
no_ss |
max |
yes |
DSSP (Q5206192) |
hydrophobic_atom |
max |
MGLTools (Q114840701) |
|
aromatic_atom |
max |
MGLTools (Q114840701) |
|
hbond_acceptor |
max |
MGLTools (Q114840701) |
|
hbond_donor |
max |
MGLTools (Q114840701) |
|
metal |
max |
MGLTools (Q114840701) |
|
eppic_entropy |
min |
yes |
EPPIC (Q114841783) |
Boolean Feature |
Source Feature |
Equality |
Threshold |
|---|---|---|---|
neg_charge |
charge |
< |
0 |
pos_charge |
charge |
> |
0 |
is_electronegative |
electrostatic_potential |
< |
0 |
is_concave |
cx |
leq |
2 |
is_hydrophobic |
hydrophobicity |
> |
0 |
residue_buried |
residue_rasa |
<$ |
0.2 |
is_conserved |
eppic_entropy |
< |
0.5 |
Feature Subsets
Many of the features are no doubt correlated, so you should pick a small subset of features that are of interest. We have created a few subsets of data that can be used to train model or use as labels.
standard
‘C’, ‘A’, ‘N’, ‘OA’, ‘OS’, ‘C_elem’, ‘N_elem’, ‘O_elem’, ‘S_elem’, ‘is_helix’, ‘is_sheet’, ‘residue_buried’, ‘is_hydrophobic’, ‘pos_charge’, ‘is_electronegative’
get_atom_type
‘H’, ‘HD’, ‘HS’, ‘C’, ‘A’, ‘N’, ‘NA’, ‘NS’, ‘OA’, ‘OS’, ‘F’, ‘MG’, ‘P’, ‘SA’, ‘S’, ‘CL’, ‘CA’, ‘MN’, ‘FE’, ‘ZN’, ‘BR’, ‘I’, ‘Unk_atom’
get_element_type
‘C_elem’, ‘N_elem’, ‘O_elem’, ‘S_elem’, ‘H_elem’, ‘F_elem’, ‘MG_elem’, ‘P_elem’, ‘CL_elem’, ‘CA_elem’, ‘MN_elem’, ‘FE_elem’, ‘ZN_elem’, ‘BR_elem’, ‘I_elem’, ‘Unk_elem’
get_charge_and_electrostatics
‘charge’, ‘neg_charge’, ‘pos_charge’, ‘electrostatic_potential’, ‘is_electronegative’
get_concavity
‘cx’, ‘is_concave’
get_hydrophobicity
‘hydrophobicity’, ‘is_hydrophobic’, ‘biological_hydrophobicity’, ‘octanal_hydrophobicity’
get_accessible_surface_area
‘atom_asa’, ‘residue_rasa’, ‘residue_buried’
get_residue
‘ALA’, ‘CYS’, ‘ASP’, ‘GLU’, ‘PHE’, ‘GLY’, ‘HIS’, ‘ILE’, ‘LYS’, ‘LEU’, ‘MET’, ‘ASN’, ‘PRO’, ‘GLN’, ‘ARG’, ‘SER’, ‘THR’, ‘VAL’, ‘TRP’, ‘TYR’, ‘Unk_residue’
get_ss
‘phi’, ‘phi_sin’, ‘phi_cos’, ‘psi’, ‘psi_sin’, ‘psi_cos’, ‘is_helix’, ‘is_sheet’, ‘Unk_SS’, ‘is_regular_helix’, ‘is_beta_bridge’, ‘is_extended_strand’, ‘is_310_helix’, ‘is_pi_helix’, ‘is_hbond_turn’, ‘is_bend’, ‘no_ss’
get_deepsite_features
‘hydrophobic_atom’, ‘aromatic_atom’, ‘hbond_acceptor’, ‘hbond_donor’, ‘metal’
get_evolutionary_conservation_score
‘eppic_entropy’, ‘is_conserved’
get_frustration
‘density_res’, ‘native_energy’, ‘decoy_energy’, ‘sd_energy’, ‘frustration_index’, ‘is_highly_frustrated’, ‘is_minimally_frustrated’, ‘has_neutral_frustration’